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Abstract: Large-scale multiprocessing remains an elusive, yet promising paradigm for achieving 
high-performance computation. As machine size scales upward, there are two important aspects 
of multiprocessor systems which will generally get worse rather than better: (1) interprocessor 
communication latency will increase and (2) the probability that some component in the system 
will fail will increase. Both of these problems can prevent us from realizing the potential benefits of 
large-scale multiprocessing. In this document we consider the problem of designing networks which 
simultaneously minimize communication latency while maximizing fault tolerance for large-scale 
multiprocessors. Using a synergy of techniques including connection topologies, routing protocols, 
signalling techniques, and packaging technologies we assemble integrated, system-level solutions 
to this network design problem. In particular, we recommend the use of multipath, multistage 
networks, simple, source-responsible routing protocols, stochastic fault-avoidance, dense three- 
dimensional packaging, low-voltage, series-terminated transmission line signalling, and scan based 
diagnostic and reconfiguration. 
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Parti 



Introduction and Background 



1. Introduction 



The high capabilities and low costs of modern microprocessors have made it attractive from both 
economic and performance viewpoints to design and construct large-scale multiprocessors based 
on commodity processor technologies. Nonetheless, many challenges remain to effectively realize 
the potential performance promised by large-scale multiprocessing on a wide -range of applications. 
One key challenge is to provide sufficient inter-processor communication performance to allow 
efficient multiprocessor operation - and to provide such performance at a reasonable cost. 

In order for processors to work effectively together in a computation, they must be able to 
communicate data with each other in a timely fashion. The exact nature and role of communication 
varies with the particular programming model, but the need is pervasive. Virtually all paradigms for 
parallel processing depend critically on low communication latency to effectively exploit parallel 
execution to reduce total execution time. Communication latency is a critical determinant of the 
amount of exploitable parallelism and the cost of synchronization. For shared-memory algorithms, 
latency affects the speed of cache-replacement and coherency operations. In message-passing 
programs, latency affects the delay between message transmission and reception. In dataflow 
programs, latency determines the delay between the computation of a data value and the time when 
the value can actually be used. Data parallel operations are limited by the rate at which processors 
can obtain access to the data on which they need to operate. 

Multithreaded ([Smi78] [Jor83] [ALKK90] [SBCvE90] [CSS+91] [NPA92]) and dataflow 
([ACM88] [AI87] [PC90]) architectures have been developed to mitigate communication latency 
by hiding its effects. These techniques all rely on an abundance of parallelism to provide useful 
processing to perform while waiting on slow communications. The limit to the usable parallelism 
then, can be determined by the nature of the problem and the algorithm used to solve it, the rate of 
computation on each processor, and the communication latency. Our challenge today is to provide 
sufficiently low-latency communications to match the computation rate provided by commodity 
processors while allowing the most effective use of the parallelism inherent in each problem. 

Regardless of the exact network topology used for communications, both the number of switch- 
ing components and the amount of wiring inside the network are at least linear in the number of 
processors supported by the network. The single component failure rate is also linear in the network 
size. If we do not engineer the network to operate properly when faults exist, the acceptable failure 
rate for any system will directly fix a ceiling on the maximum machine size. To avoid this ceiling 
we consider network designs which can operate properly in the presence of faults. 

In this document, we examine a class of processor interconnection networks which are designed 
to simultaneously minimize network latency while maximizing fault tolerance. A combination of 
organizational techniques, protocols, circuit techniques, and packaging technologies are employed 
to realize a class of integrated solutions to these problems. 



1.1 Goals 

Our goals in designing a high-performance network for large-scale multiprocessing are to 
optimize for: 

• Low Latency 

• High Bandwidth 

• High Reliability 

• Testability/Repairability 

• Scalability 

• Flexibility/Versatility 

• Reasonable Cost 

• Practical Implementation 

As suggested above and developed further in Sections 2.3 and 2.5, latency and reliability are key 
properties which must be considered when designing a large-scale, high-performance multipro- 
cessor network. Insufficient bandwidth will have a detrimental impact on latency (Section 2.4). 
Fault diagnosis and repair are key to limiting the impact of any faults in the network (Section 2.6). 
Scalability of the solution is important to maximize the longevity with which the solutions are 
effective. Flexibility in the solutions allow the class of networks to remain applicable across a wide 
range of specific needs (Section 2.8). 

1.2 Scope 

This work only attempts to address issues directly related to the network for a large-scale 
multiprocessor. Attention is paid to providing efficient and robust interfaces between processing 
nodes and the network. Attention is also given to how the node interacts with the network. However, 
the fault-tolerance schemes presented here do not guard against failures of the processing nodes or 
in the memory system. The scheme detailed here may be suitable for a reliable network substrate 
for future work in processor and memory fault recovery. 

1.3 Overview 

In this section, we provide a quick overview of the network design at several levels. This section 
should give the reader a basic picture of the class of networks and technologies being considered. 
Part II develops everything introduced here in detail. 




A multibutterfly style interconnection network constructed from 4x2 (inputs x radix) 
dilation-2 crossbars and 2 x 2 dilation- 1 crossbars. Each of the 16endpointshas two inputs 
and outputs for fault tolerance. Similarly, the routers each have two outputs in each of 
their two logical output directions. As a result, there are many paths between each pair of 
network endpoints. Paths between endpoint6 and endpoint 16 are shown in bold. 



Figure 1.1: 16 X 16 Multibutterfly Network 



1.3.1 Topology 

A suitable network topology is the first essential ingredient to producing a reliable, high- 
performance network. The network topology will ultimately dictate: 

• Switching Latency - the number of switches, and to some extent the length of the wires, 
which must be traversed between nodes in the network 

• Underlying Reliability - the redundancy available to make fault-tolerant operation possible 

• Scalability - the characteristic growth of resource requirements with system size 

• Versatility - the extent to which the network can be adapted to a wide-range of applications. 

To simultaneously optimize these characteristics, we utilize multipath, multistage interconnec- 
tion networks based on several key ideas from the theoretical community including multibutterflies 
[Upf89] [LM92] and fat trees [Lei85]. 

Using multibutterfly (See Figure 1.1) and fat-tree networks (See Figure 1.2), we minimize the 
number of routing switches which must be traversed in the network between any pair of nodes. 
Using bounded degree routing nodes, the least possible number of switches between endpoints 
is logarithmic in the size of the network, a lower bound which these networks achieve. For 
small machine configurations the multibutterfly networks achieve the logarithmic lower bound 
with a multiplicative constant of one (e.g. routing switches traversed = log r iV; where N is 




Figure 1.2: Area- Universal Fat- Tree with Constant Size Switches (Greenberg and Leiserson) 



the number of processing nodes in the network and r is the radix of the routing component 
used for switching). For larger machine configurations, fat trees provide lower latency for local 
communication. Applications can take advantage of the locality inherent in the fat-tree topology to 
realize lower average communication latencies. To further minimize switching latency, our fat-tree 
networks make use of short-cut paths, keeping the worst-case switching latency down to | log 4 N 
when using radix-four routing components. 

The multipath nature of these routing networks provides a basis for fault-tolerant operation, as 
well as providing high bandwidth operation. The multipath networks provide multiple, redundant 
paths between every pair of processing nodes. The alternative paths are also available for min- 
imizing congestion within the network, resulting in increased effective bandwidth and decreased 
effective latency. When faults occur, the availability of alternative paths between endpoints makes 
it possible to route around faulty components in the network. 

A high-degree of scalability is achieved by using fat-tree organizations for large networks. The 
scalable properties of fat trees allow construction of arbitrarily large machines using the same basic 
network architecture. When organized properly, these large fat trees can be shown to minimize 
the total length of time that any message spends traversing wires within the routing network as 
compared to any other network. The hardware resources required for the fat-tree network grow 
linearly in the number of processors supported. 

Further, these networks provide considerable versatility allowing them to be adapted to meet 
the specific needs of a particular application. By selecting the number of network ports into each 



processing node, we can customize the bandwidth and reliability within the network to meet the 
needs of the application. By controlling the width of the basic data channel, we can provide 
varying amounts of latency and bandwidth into a node. This flexibility makes it possible to use 
the same basic network solutions across a broad range of machines from low-cost workstations to 
high-bandwidth supercomputers by selecting the network parameters appropriately. 

1.3.2 Routing 

While a good network topology is necessary for reliable, high-performance communications, 
it is by no means sufficient. We must also have a routing scheme capable of efficiently exploiting 
the features of the network. In developing a routing strategy for use with multiprocessor commu- 
nications networks, we focussed on achieving a routing framework with the following properties: 

1. Low-overhead routing - Low-overhead routing attempts to minimize the fraction of poten- 
tial bandwidth consumed by protocol overhead and similarly minimize the latency associated 
with protocol processing. 

2. Fault identification and localization with minimal overhead - To achieve fault tolerance, 
we must be able to detect when faults corrupt data in our system. Further to minimize the 
impact of faults on system performance, we must be able to efficiently identify the source of 
any faults in the system. 

3. Flexible protocol - To be suitable for use in a wide range of applications and environments, 
the protocol must be flexible allowing efficient layering of the required data transfer on top 
of the underlying communications. 

4. Dynamic fault tolerance - For the network to scale robustly to very large implementations, 
it is critical that the network and routing components continue to operate properly as new 
faults arise in the system. 

5. Distributed routing - In order to avoid single-points of failure in the system, routing must 
proceed in a distributed fashion, requiring the correct operation of no central resources. 

To this end, we have developed the METRO Routing Protocol, MRP, a simple, reliable, source- 
responsible router protocol suitable for use with multipath networks. MRP provides half-duplex, 
bidirectional data transmission over pipelined, circuit-switched routing channels. The simple pro- 
tocol coupled with pipelined routing allows for high-bandwidth, low-latency implementations. The 
circuit-switched nature avoids the issues associated with buffering inside the network. Each routing 
component makes local routing decisions among equivalent outputs based on channel utilization, 
using randomization to choose among equivalent alternatives. Routing components further provide 
connection information and checksums back to the source node to allow error localization within 
the network. When errors or blocking occurs, the source can retry data transmission. The ran- 
domization in path selection guarantees that any existing non-faulty path can eventually be found 
without global information. 
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Figure 1.3: Cross-Section of Stack Packaging (Diagram courtesy of Fred Drenckhahn) 

1.3.3 Technology 

Regardless of the advances we make in topology and routing, the ultimate performance of an 
implementation is limited by the implementation technology Packaging density constrains the 
minimum lengths for interconnect and hence the minimum latency between routing components 
and nodes. Once our interconnection distances are fixed, data transmission latency is limited by 
the time taken to traverse the interconnect and to traverse component i/o pads. 



Packaging 

Our goal in packaging these networks is to minimize the interconnection distances between 
components. At the same time, we aim to utilize economical technologies and provide efficient 
cooling and repair of densely packaged components. The basic packaging unit is a three-dimensional 



stack of components and printed-circuit boards (See Figure 1.3). Computational, memory, and 
routing components are housed in dual-sided land-grid arrays and sandwiched between layers of 
conventional PCBs. The land-grid arrays, with pads on both sides of the package, serve to both 
house VLSI components and provide vertical interconnect in the stack structure. Button boards 
are used to provide reliable, solderless connection between land-grid array packages and adjacent 
PCBs. The land-grid array and button board packages provide channels for coolant flow. The 
composite stack structure is compatible with both air and liquid cooling. The stack structure 
provides the necessary dense interconnection in all three physical dimensions allowing for minimal 
wiring distances between components. Using this technology, we can package an entire 64-node 
multiprocessor including the network and nodes in roughly 1' x 1' X 5". 

Signalling 

To minimize wire transit and component i/o time, we utilize series-terminated, matched- 
impedance, point-to-point transmission line signalling. Further, to reduce power consumption 
the i/o structures use low-voltage signal swings. By integrating a series-terminated transmission 
line driver into the i/o pads, we avoid the need to wait for reflections to settle on the PCB traces 
without requiring additional external components. The low-voltage, series-terminated drivers can 
switch much faster than conventional 5V-swing drivers. Initial experience with this technology 
indicates we can drive a signal through an output pad, across 30 cm of wire, and into an input pad 
in less than 5 ns. 

1.3.4 Fault Management 

Performance in the presence of faulty components and wires can be further improved by hiding 
the effects of faulty components. Using some novel, fault-tolerant additions to baseline IEEE 
1149.1-1990 JTAG scan functionality, we can realize an effective scan-based testing strategy. By 
configuring components with multiple test-access ports, the architecture is resilient to faults in the 
test system itself. With port-by-port deselection and scan capabilities, it is possible to diagnose 
potentially faulty network components online; i.e. , while the rest of the system remains fully 
operational. Furthermore, these facilities allow faulty wires and components to be configured out 
of the system so that they do not degrade system performance. Once localized using boundary 
scan, the system can log faulty components for later repair and make an accurate assessment of the 
system integrity. For larger systems, these facilities allow online replacement of faulty subsystems. 

1.4 Organization 

Before developing strategies for addressing these problems, Chapter 2 develops the problems 
and issues in further detail. Part II takes a detailed look at the key components of robust, low-latency 
networks. Chapter 3 leads off by examining the network topology. Chapter 4 addresses the issue 
of low-latency, high-speed, reliable routing on the networks introduced in Chapter'3. Chapter 5 
considers fault identification and system reconfiguration. Chapter 6 develops suitable, high-speed 
signalling techniques compatible with the router-to-router communications required by networks the 
routing protocol. Finally, Chapter 7 looks at packaging technologies for practical, high-performance 



networks. Part III contains a brief series of case-studies from our experience designing and building 
reliable, low-latency networks. Chapter 8 reviews the RN1 routing component. Chapter 9 discusses 
RNl's successor, the METRO router series. Chapter 11 describes METRO-LINK, a network interface 
suitable for connecting a processing node into a METRO based network. Finally, Chapters 10 and 12 
discuss MBTA, an experimental multiprocessor which puts most of the technology described in 
Part II and the components detailed in Part III together in a complete multiprocessor system. 
Chapter 13 concludes by reviewing the techniques introduced in Part II and showing how they 
come together to achieve low-latency and fault-tolerant operation. 



2. Background 



This chapter provides background material to prepare the reader for the development in Parts II 
and III. Section 2. 1 describes the fault model and multiprocessor model assumed throughout this 
document. Section 2.2 provides a brief review of standard scan based testing practices. Section 2.3 
and 2.5 point out the importance of low latency and fault tolerance to large-scale multiprocessor 
systems. Section 2.4 reviews the composition of network latency. Section 2.6 looks at the 
requirements for fault tolerance. Finally, Sections 2.7 and 2.8 introduce several other key issues in 
the practical design of interconnection networks. 

2.1 Models 

2.1.1 Fault Model 

Faults occurring in a network may be either static or dynamic and may be transient faults or 
permanent faults. While a permanent fault occurs and remains a fault, a transient fault may only 
persist for a short period of time. Transient faults which recur with notable frequency are termed 
intermittent. [SS92] indicate that transient and intermittent faults account for the vast majority of 
faults which occur in computer systems. For the purposes of this presentation, static faults are 
permanent or intermittent faults which have occurred at some point in the past and are known to 
the system as a whole. Dynamic faults are transient faults or any faults which the system has not 
yet detected. 

Throughout this work, we assume that faults manifest themselves as: 

1 . Stuck-Values - a data or control line appears to be held exclusively high or low 

2. Random bit flips - a data or control line has some incorrect, but random value 

Faults may appear and disappear at any point in time. They may become permanent and remain 
in the system, they may be transient and disappear, or they may be intermittent and recurring. 
Stuck- value errors may take on an arbitrary, but constant, logic value. Bit flips are assumed to take 
on random values. Specifically, we are not assuming an adversarial fault model {e.g. [MR91]) in 
which faulty portions of the system are allowed to take on arbitrary erroneous values. 

These fault-manifestations are chosen to be consistent with fault expectations in digital hardware 
systems. Structural faults in the interconnect between components may give rise to floating or 
shorted nodes. With proper electrical design, floating i/o's can appear as stuck-values to internal 
logic. Shorted nodes will depend on the values present on the shorted nodes and may appear as 
random bit flips when the values differ. Clocking, timing, and noise problems which cause incorrect 
data to be sampled by a component will also appear as random bit errors. Opens and bridging faults 
within an IC may also leave nodes shorted or floating. For a good survey of physical faults and 
their manifestations see Chapter 2 in [SS92]. 
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The manner in which we handle dynamic faults in this work relies on end-to-end checksums to 
make the likelihood that a corrupted message looks like a good message arbitrarily small. As long 
as faults produce random data, we can select a checksum which has the desired property However, 
if we allow arbitrary, malicious intervention as in an adversarial fault model, the adversary could 
remove a corrupted message from the network and replace it with one which looks good or remove 
a good message from the network and fake an acknowledgment. In order to handle this stronger 
fault-model, one would have to replace our practice of guarding data with checksums with an 
end-to-end data encryption scheme. A properly chosen encryption scheme could make the chances 
that an adversary could fake any message sufficiently remote for any particular application. 

For the sake of the presentation here, we limit our concern to faults within the network itself. 
The processing nodes are presumed to function correctly, if at all. A processing node may cease to 
function, but it may not provide erroneous data to the network. All network transactions requested 
by the node are presumed to be intentional. The computational implications of losing access to an 
ongoing computation or the memory stored at a failing node are important but beyond the scope of 
this work. 

Without knowing the reliability design of the computational system as a whole, it is not clear 
whether a fault-tolerant network should be designed to optimize for harvest or yield. Yield is the 
term used to describe the likelihood that the system can be used to complete a given task. If we 
require that all nodes be fully connected to the network, then designing the network is a yield 
problem in which the network is only considered good when it provides full connectivity. In this 
case, we want to optimize for the highest yield at the fault levels of interest. Harvest Rate is 
the term used to refer to the fraction of total functional unit which are usable in a system. If the 
computational model can cope with the node loss, then designing the network is a harvest problem 
in which we attempt to optimize for the most connectivity at any fault level. 

2.1.2 Multiprocessor Model 

For the purpose of discussion, we assume a homogenous, distributed memory, multiprocessor 
model as shown in Figure 2.1. Each node is composed of a processor, some memory, and a network 
interface. In a hardware- supported shared-memory machine, this network interface might be the 
cache-controller [LLG + 91] [ACD+91]; in a message -passing machine, it would be the network 
message interface [Cor91] [Thi91]. Increasingly, the network interface may be tightly-integrated 
with the processor [D + 92] [NPA91]. We explicitly assume the network interface has multiple 
connections both into the network and out of the network. Multiple connections are necessary 
to avoid having a potential single point of failure at the connection between each node and the 
network. 

2.2 IEEE-1149.1-1990 TAP 

In Part II, we introduce extensions to standard, scan-based testing practices to make them 
suitable for use in large-scale systems. This section reviews the major points of the existing 
standard upon which we are building. 

The IEEE Standard Test-Access Port (TAP) [Com90] defines a serial test interface requiring 
four dedicated I/O pins on each component. The standard allows components to be daisy-chained 
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Figure 2.1: Multiprocessor Model 
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Figure 2.2: Standard IEEE TAP and Scan Architecture 

so that a single test path can provide access to many or all components in a system. The standard 
provides facilities for external boundary-scan testing, internal component functional testing, and 
internal scan testing. Additionally, the TAP provides access to component-specific testing and 
configuration facilities. Figure 2.2 shows the basic architecture for an IEEE scan-based TAP. 

In a system in which all components comply with the standard, boundary-scan testing allows 
complete structural testing. Using the serial scan path, every I/O pin in the system can be configured 
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to drive a logic value or act as a receiver. Using the same serial scan path, the value of every receiver 
can be sampled and recovered. This mechanism allows the TAP to verify the complete connectivity 
of the components in the system. All connectivity faults, shorted wires, stuck drivers or receivers, 
or open-circuits can be identified in this manner [GM82] [Wag87]. 

The scan path allows data to be driven into a component independent of the values present on 
the component's external I/O pins. The resultant values generated by the component in response 
to the driven data can similarly be sampled and recovered via the serial scan path. This facility 
permits functional, in-circuit verification of any such component. 

The standard allows additional instructions which may function in a component-specific manner. 
These instructions provide uniform access to internal-component scan-paths. Such internal paths 
are commonly used to allow a small number of test-patterns to achieve high-fault coverage in 
components with significant internal state. Other common additions are configuration registers and 
Built-In-Self-Test (BIST) facilities [KMZ79] [LeB84] [Lak86]. 

2.3 Effects of Latency 1 

For the sake of understanding the role of latency in multiprocessor communications, we consider 
a very simple model of parallel computation. To solve our problem we need to execute a total 
number of operations, c. Let us assume our problem is characterized by a constant amount of 
parallelism, p. During each clock cycle, we can perform p operations. Parallelism is limited 
because each set of p operations depends on the results of the previous p operations. After a set 
of operations complete, they must communicate their results with the processors which need those 
results for the next set of p operations. Let us assume that communicating between processors 
requires / clock cycles of latency. 

If we executed our program on a multiprocessor with more than p nodes, it would take time 
Tmuitiproc cycles to solve the problem. 

c-(/+l) 

-L multiproc — I— • * J 

p 

At clock cycle 1 , we can execute p operations on the nodes. We then require / cycles to communicate 
the results. The next p operations can then be executed in cycle / + 2. Computation continues 
in this manner executing p operations every (/ + 1) cycles. Thus ^j operations are executed, on 
average, each cycle giving us Equation 2. 1 . 

We see immediately that the exploitable parallelism is limited by the latency of communication. 
If our problem allows much more parallelism than we have nodes in our multiprocessor, we 
can hide the effects of latency by performing other operations in a set while waiting for the 
communication associated with the earlier operations to complete. However, if we wish to use 
large-scale multiprocessors to solve big problems, latency directly acts to limit the extent to which 
we can exploit parallel execution to solve our problem quickly. 

In most parallel programs, the number of operations which can be executed in parallel varies 
throughout the program's execution. Hence p is not a constant. Researchers have characterized 
this parallelism for particular programs and computational models using a parallelism profile which 



1 The basic argument presented here is drawn from an unpublished manuscript by Professor Michael Dertouzos. 
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shows the number of operations which may be executed simultaneously at each time-step assuming 
an unbounded number of processors (e.g. [AI87]). The available parallelism will be a function of 
the compiler and run-time system in addition to being dependent on the problem being solved and 
the algorithm used to solve it. 

Communication latency is also, generally, not constant. Section 2.4 looks at the factors that 
affect latency in a multiprocessor network. 

Despite the fact that our model used above is overly simplistic, it does gives us insight into 
the role which latency plays in parallel computing. When our algorithm, compiler, and run-time 
system can discover much more parallelism than we have processing elements to support, with 
good engineering we can hide some or all of the effects of latency. On the other hand, when we are 
unable to find such a surplus of parallelism, latency further derates the exploitable parallelism in a 
linear fashion. 

2.4 Latency Issues 

In this section, we consider in further detail many of the issues relevant to achieving low-latency 
communications . 

2.4.1 Network Latency 

Ignoring protocol overhead at the destination or receiving ends of a network, the latency in an 
interconnection network comes from four basic factors: 

1. Transit Latency (T t ): The amount of time the message spends traversing the interconnection 
media within the network 

2. Switching Latency (T s ): The amount of time the message spends being switched or routed 
by switching elements inside the network 

3. Transmission Time (T transm i t ): The time required to transmit the entire contents of a message 
into or out-of the network 

4. Contention Latency (7): The degradation in network latency due to resource contention in 
the network 

Transit latency is generally dictated by physics and geometry. Transit latency is the quotient of 
the physical distance and the rate of signal propagation. 

T t = - (2.2) 

v 

Basic physics limits the amount of time that it takes for a signal to traverse a given distance. Materials 
will affect the actual rate of signal propagation, but regardless of the material, the propagation speed 
will always be below the speed of light, cs; 3 X 10 10 cm/s. The rate of propagation is given by: 

' (2.3) 



/fie 
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For most materials fi & /iq, where [1q is the permittivity of free space. Conventional printed-circuit 
boards (PCBs) have e = c t cq, where e r & 4 and eo is the dielectric constant of free space; thus, 
v Ft | . High performance substrates have lower values for e r . The physical geometry for the 
network determines the physical interconnection distances, d. Physical geometry is partially in the 
domain of packaging (Chapter 7), but is also determined by the network topology (Chapter 3). All 
networks are limited to exploiting, at most, three-dimensional space. Even in the best case, the 
total transit distance between two nodes in a network is at least limited by the physical distance 
between them in three-space. Additionally, since physical interconnection channels {e.g. wires, 
PCB traces, silicon) occupy physical space, the volume these channels consume within the network 
often affects the physical space into which the network and nodes may be packed. 

For networks with uniform switching nodes, switching latency is the product of the number of 
switching stages between endpoints, s n , and the latency of each switching node, t n \. 

T s = s n - t n i (2.4) 

The network topology dictates the number of switching stages. The latency of each switching node 
is the sum of the signal i/o latency, t{ , and the switching node functional latency, t sw i tc h- 

^nl — *>io T I switch V J 

The signal i/o latency, or the amount of time required to move signals into and out-of the switching 
node, is generally determined by the signalling discipline and the technologies used for the switching 
node (Chapter 6). The switch functional latency accounts for the time required to arbitrate for an 
appropriate output channel and move message data from the input channel to the output channel. In 
addition to technology, the switch functional latency will depend on the complexity of the routing 
and arbitration schemes and the complexity of the switching function (Chapter 4). Larger switches 
generally require more complicated arbitration and switching, resulting in larger inherent switching 
latencies. 

The transmission time accounts for the amount of time required to move the entire message 
data into or out-of the network. In many networks, the amount of data transmitted in a message is 
larger than the width of a data channel. In these case, the data is generally transmitted as a sequence 
of data where each piece is limited to the width of the channel. Assuming we have a message of 
length L to send over a channel w bits wide which can accept new data every t c time units, we have 
the transmission time, T iransm a, given by: 



T, 



transmit 



tc (2.6) 



Here we see one of the places where low bandwidth has a detrimental effect on network latency. 
Transmit increases as the channel bandwidth decreases. 

Contention latency arises when resource conflicts occur and a message must wait until the nec- 
essary resources are available before it can be delivered to its designated destination. Such conflicts 
result when the network has insufficient bandwidth or the network bandwidth is inefficiently used. 
In packet- switched networks, contention latency manifests itself in the form of queuing which must 
occur within switches when output channels are blocked. In circuit-switched networks, contention 
latency is incurred when multiple messages require the same channel(s) in the network and some 
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messages must wait for others to complete. Contention latency is the effect which differentiates 
an architecture's theoretical minimum latency from its realized latency. The amount of contention 
latency is highly dependent on the manner in which an application utilizes the network. Contention 
latency is also affected by the routing protocol (Chapter 4) and network organization (Chapter 3). 
One can think of contention latency as a derating factor on the unloaded network latency. 

T un loaded = T s + Tf (2-7) 

Tnet = 7 (application, topology) • T un i oaded + T transmit (2.8) 

One of the easiest ways to see this derating effect is when an application requires more bandwidth 
between two sets of processors than the network topology provides. In such a case, the effective 
latency will be increased by a factor equal to the ratio of the desired application bandwidth to the 
available network bandwidth, e.g. if A\, w is the bandwidth needed by an application, and N bw is 
the bandwidth provided by the network for the required communication, we have: 

A-hw 

7 



N bw 

In practice, the derating factor is generally larger than a simple ratio due to the fact that the resource 
conflicts themselves may consume bandwidth. For example, on most local-area networks, when 
contention results in collisions, the time lost during the collision adds to the network latency as 
well as the time to finally transmit the message. 

The effects of contention latency make it clear why a bus is inefficient for multiprocessor 
operation. The bus provides a fixed bandwidth, N bw . There is no switching latency and generally 
a small transit latency over the bus. However, as we add processors to the bus, the bandwidth 
potentially usable by the application, Af, w , generally increases while the network bandwidth stays 
fixed. This translates into a large contention derating factor, 7, and consequently high network 
latency. 

Unfortunately, it is hard to quantify the contention latency factor as cleanly as we can quantify 
other network latency factors. The bandwidth required between any pair of processors is highly 
dependent on the application, the computational model in use, and the run-time system. Further, it 
depends not just on the available bandwidth between a pair of processors, but between any sets of 
processors which may wish to communicate simultaneously. 

2.4.2 Locality 

Often physical and logical locality within a network can be exploited to minimize the average 
communication latency. In many networks, nodes are not equidistant. The transit latency and 
switching latency between a pair of nodes may vary greatly based on the choice of the pair of 
nodes. Logical distance is used to refer to the amount of switching required between two nodes 
(T s ), and physical distance is used refer to the transit latency (Tj,) required between two nodes. Thus, 
two nodes which are closer, or more local, to each other logically and physically may communicate 
with lower latency than two nodes which are further apart. Additionally, when logically close nodes 
communicate they use less switching resources and hence contribute less to resource contention in 
the network. 
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The extent to which locality can be exploited is highly dependent upon the application being run 
over the network. The exploitation of network locality to minimize the effective communication 
latency in a multiprocessor system is an active area of current research [KLS90] [ACEX 1- 91] [Wal92]. 
Exploiting network locality is of particular interest when designing scalable computer systems since 
the latency of the interconnect will necessarily increase with network size. Assuming the physical 
and logical composition of the network remains unchanged when the network is grown, for networks 
without locality, the physical distance between all nodes grows as the system grows due to spatial 
constraints. For networks with locality the physical distance between the farthest separated nodes 
grows. Additionally, as long as bounded-degree switches (Section 2.7.1) are used to construct the 
network, the logical distance between nodes increases as well. Locality exploitation is one hope 
for mitigating the effects of this increase in latency. 

It is necessary to keep the benefits due to locality in proper perspective with respect to the entire 
system. A small gain due to locality can often be dwarfed by the fixed overheads associated with 
communication over a multiprocessor network. Locality optimizations yield negligible rewards 
when the transmission latency benefit is small compared to the latency associated with launching and 
handling the message. Johnson demonstrated upper bounds on the benefits of locality exploitation 
using a simple mathematical model [Joh92]. For a specific system ([ACD+91]), he shows that 
even for machines as large as 1000 processors, the upper bound on the performance benefit due to 
locality exploitation is a factor of two. 

2.4.3 Node Handling Latency 

This document concentrates on designing the network for a high-performance multiprocessor. 
Nonetheless, it is worthwhile to point out that the effective latency seen by the processors is also 
dependent on the latency associated with getting messages from the computation into the network, 
out-of the network, and back into the computation. Network input latency, T p , is the amount of time 
after a processor decides to issue a transaction over the network, before the message can be launched 
into the network, assuming network contention does not prevent the transaction from entering the 
network. Similarly, network output latency, T w , is the amount of time between the arrival of the a 
complete message at the destination node and the time the processor may begin actually processing 
the message. If not implemented carefully, large network input and output latency can limit the 
extent to which low-latency networks can facilitate low-latency communication between nodes. 
Combining these effects with our network latency we have the total processor to processor message 
latency: 

J- message — -*- p T J- net T J-w V J 

This document will not attempt to directly address how one minimizes node input and output 
latency. Node latencies such as these are highly dependent on the programming model, processor, 
controller, and memory system in use. [NPA92] and [D+92] describe processors which were 
designed to minimize these latencies. [E + 92] and [CSS + 91] describe a computational model 
intended to minimize these latencies. Here, we will devote some attention to assuring that the 
network itself does not impose limitations which require large node input and output latencies. 
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2.5 Faults in Large- Systems 

In this section we review a first-order model of system failure rates. We use this simple model 
to underscore the importance of fault tolerance in large-scale systems. 

For the sake of simplicity, let us begin by considering a simple discrete model of component 
failures. A single component fails with some probability, P c in time T. This gives us a failure rate: 



X P = -4 (2-10) 



Pc 

T 
The probability that the component survives a period of time T, is then: 

P cs = l-P c (2.11) 

If we have a system with N components which fails if any of the individual component fail, then 
the system survives a period of time T only if all components survive the period of time T. Thus: 

Pss = P? s =(l-P c ) N (2.12) 

For any reasonable component and a small time period, T, P c << 1. To first order, Equation 2.12 
can be reasonably be approximated as: 

P ss = (l-N-P c ) (2.13) 

Which tells us the probability that the system fails during time T is simple: 

P S = N-P C (2.14) 

Which corresponds to a failure rate: 

N • P 
X s = ^r S = N-X p (2.15) 

From Equation 2.15 we see that the failure rate increases linearly with the number of components 
in the system, to first order. 

Example A moderate complexity, modern component has a failure rate of ten failures per million 
hours (A c & 10~ 5 hr -1 ) (See [0D86] for estimating component failure rates). A million component 
machine which depended on all million components working correctly, would have: 



A s = N ■ X p = 10 6 X lO^hr" 1 = 10/hr (2.16) 



This gives the machine a Mean Time To Failure (MTTF) of 6 minutes. 

If we can relax the requirement that all components and interconnect function correctly in order 
for the system to be operational, we can improve the MTTF. If we can sustain k faults before the 
system is rendered inoperative, the MTTF will be longer. As long as A; << N, we can assume a 
constant failure rate for components given by Equation 2.15. Assuming the faults are independent, 
the rate of occurrence of k failures is: 

X k = y (2.17) 
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That is, the MTTF increases linearly with the number of tolerated faults. Revisiting our example, 
if we design the system to sustain 1000 faults (k = 1000), or just 0. 1 % of the total components in 
the system, the MTTF increases by a factor of 1000 to 6000 minutes or 100 hours. 

For sufficiently large systems, we cannot achieve an adequately low system failure rate by 
requiring that every component in the system function properly. Rather, we must design sufficient 
redundancy into our system to achieve the reliability desired. 

2.6 Fault Tolerance 

In order to achieve fault tolerance, we need the ability to detect when faults have occurred and 
the ability to handle faults which have occurred. Typically, one uses redundancy in some form to 
satisfy both of these needs. Redundant data transmitted along with the message can be used to 
identify when portions of a message are damaged. Parity bits and message checksums are common 
examples of redundant data used to identify data corruption. Once faults are detected, we rely on 
redundant network hardware to avoid faulty portions of the network. That is, there must be different 
resources which perform the same function as the faulty portion of the network which can be used 
in place of the faulty portion. We also need a mechanism for exploiting the redundancy. The 
network organization (Chapter 3) often provides the resource redundancy. The routing protocol 
(Chapter 4) provides the redundancy for fault detection and provides mechanisms for exploiting 
the redundancy in the network. 

Designing networks to perform well in the presence of faults is very similar to designing 
networks to perform well in the presence of contention. Faults in the network look much like 
contention. Faulty resources are not useful for effectively routing data. In this manner, they have 
the same effect as resources which are always in use. Faulty resources also cause additional traffic 
in the network since they may corrupt messages and hence require the messages to be retransmitted. 
Alternately, we can think of the faulty resources as migrating routing traffic which they would have 
handled to other resources in the network. These non-faulty resources now see more traffic as a 
result. Appropriate design can yield solutions which improve both the performance of the system 
in the face of faults and the performance of the system in the face of heavy traffic. 

2.7 Pragmatic Considerations 

We must also consider several pragmatic considerations associated with building any systems. 
When building a system, such as a network, we are constrained by the economics of currently 
available technology, issues of design complexity, and fundamental physical constraints. 

2.7.1 Physical Constraints 

For instance, we have already observed that the speed of signal propagation is largely fixed by 
the speed of light and the dielectric constant of readily available materials. Materials with notably 
lower dielectrics do exist, but the cost and reliability of these materials currently relegates their use 
to small, high-end systems. As technology improves, we can expect these or other materials with 
lower dielectric constants to be available at prices which make their use more worthwhile. 
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One might consider using light, itself to achieve the maximum transmission rate (v = c). In 
some situations, this makes the most sense. However, the fact that signals must be converted 
from propagating electrons to propagating photons and back again, often defeats any potential 
gains. The latency associated with converting from electrons to photons and back is currently 
large. Even assuming 100% power effeciency, modern optical modulator/detectors have at least an 
order of magnitude more i/o latency, t{ , than purely electrical i/o pads (e.g. 40 ns versus 3 ns) 
at comparable power levels [LHM + 89] . Since optical detection latency is inversely proportional 
to the incident power level, the optical conversion would require an order of magnitude greater 
power than the electrical pads to make the optical i/o latency comparable to electrical i/o latency. It 
only makes sense to make this optical conversion when the distance traversed is sufficiently large 
that the reduction in physical transit latency due to faster propagation is larger than the conversion 
latencies. 

Current VLSI technology limits the bonding of i/o pads to the periphery of the integrated circuit 
die. This forces the number of i/o channels into an integrated circuit (IC) to be proportional to the 
perimeter of the die. Due to external bonding requirements, i/o pads are shrinking more slowly than 
other IC features. Consequently, ICs have a fairly fixed, limited number of i/o pads and this number 
is not scaling comparable to the rate of scaling of useful silicon area inside the die. Available 
technology, thus, limits the number of i/o channels into an IC and hence the size of the primitive 
switching elements we can build. 

We must always take account of the fact that wires and components consume space. The finite 
thickness of wires limits the physical compactness of our multiprocessor. The space between nodes 
and routers must be large enough to accommodate the wires necessary to provide interconnect. In 
some topologies, the growth rate of the machine is dictated by the growth rate for the interconnect 
as much as the number and size of components. Additionally, space must be provided for adequate 
component cooling and access for repair. 

2.7.2 Design Complexity 

Each different component in a system requires separate: 

• Engineering effort to design and verify 

• Non-recurring engineering (NRE) costs to produce 

• Testing to select good components and diagnose potentially faulty components 

• Shelf-space to stock the components 

Consequently, it is beneficial to minimize the number of different components used in constructing 
any system. 

2.8 Flexibility Concerns 

Just as engineering more types of components is costly in terms of development, NRE, and 
testing, designing a new network for each new application or specific machine is also costly. 
We look for solutions which provide a wide range of flexibility so they can easily be extended 

20 



or re-parameterized to solve a variety of problems. When building a network for a large-scale 
multiprocessor, our desire for flexibility leads us to be concerned about the following: 

• How do we provide additional bandwidth for each node at a given level of semiconductor 
and packaging technology? 

• How do we get more/less fault tolerance for applications which have a higher/lower premium 
for faults 

• How do we build larger (smaller) machines? 

• How can we decrease latency? at what costs? 
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Part II 

Engineering Reliable, Low-Latency 

Networks 
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3. Network Organization 



In this chapter we survey potential low-latency networks and identify a family of networks which 
is most suitable for use in large-scale, fault-tolerant multiprocessors given practical considerations. 
After determining the basic network structure, we examine the issues involved in optimizing a 
particular network for a given application. 

3.1 Low-Latency Networks 

3.1.1 Fully Connected Network 

From the standpoint of latency, the optimal network is a fully-connected network in which 
every processor has a direct connection to every other processor (See Figure 3.1). Here, there is no 
switching latency (i.e. T s = 0). The problem with this network, of course, is that the processor 
node size grows linearly with the size of the system. This is not practical for several reasons. We 
cannot build very large networks with bounded pin-out components, and a different component size 
is needed for each different network size. Using techniques from [Tho80] and [LR86], we find the 
interwiring resources will grow as &(N 3 ). Wiring constraints alone require that the best packaging 
volume grows as &(N 3 ), making, in the best case, the wiring distances, d, grow as &(N). Such an 
organization is not very practical. 



3.1.2 Full Crossbar 

Next, we consider a full crossbar arrangement (See Figure 3.2). If we could build a large enough 
crossbar, we only traverse on switching node between any source-destination pair. Unfortunately, 





Figure 3.1: Fully Connected Networks 
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Figure 3.2: Full 16 X 16 Crossbar 



Figure 3.3: Distributed 16x16 Crossbar 

our pin limitations (Section 2.7.1), will not allow us to build a single crossbar of arbitrary size. In 
practice, we would have to distribute the function across many different components as shown in 
Figure 3.3. This would incur 0{n) switching latency and require 0{n 2 ) such switches. 



3.1.3 Hypercube 

We might consider building a hypercube network to exploit locality and distributed routing 
control. The switching latency is log 2 (iV) as we need traverse at most one switching link in 
each dimension of the hypercube. Unfortunately, to maintain this characteristic, the switching 
node degree grows as ©(log(iV)). Node size soon runs into our pin limitations (Section 2.7.1) 
and a different size node is needed for each size of the machine constructed. Additionally, when 
implemented in three-dimensional space, the interconnection requirements cause the machine 

3 

volume to grow as ©(JVz). This result is also derivable from the techinques presented in [Tho80] 
and [LR86] by considering the number of wires which must cross through the middle of the machine 
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Shown above is a 16 processor hypercube. (Drawing by Frederic Chong) 



Figure 3.4: Hypercube 



Figure 3.5: Mesh - A;-ary-ra-cube with k = 2 



in any decomposition. If we divide an N -processor machine in half, the number of wires crossing 
the bisecting plane will be 0( N ) . If we distribute these wires in the two-dimensional plane dividing 
the two halves, then the plane is ®(y/N) wire widths wide in each dimension. Considering that we 
get the same effect if we divide the machine via an orthogonal plane which also bisects the machine, 
we see that the machine is ®(y/N) long in each dimension and hence the volume is ©( JV 2 ). From 
this we can see that the transit distance, d, will generally grow as &(\^N). 

Making some compromises for practicality on the basic hypercube structure, a number of 
derivative networks result. The next two sections cover two major classes, multistage networks 
and k-ary-n-cubes. 



3.1.4 A;-ary-ra-cube 

For A;-ary-ra-cubes, we fix the dimension (k) to avoid the switching node size growth problem 
associated with the pure hypercube. We still get the locality and distributed routing. The switching 



latency grows as 0(\/N) since there are at most yiV routers which must be traversed in each 
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Shown above is a 27 processor cube network. (Drawing by Frederic Chong) 



Figure 3.6: Cube - A;-ary-ra-cube with k = 3 
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Figure 3.7: Torus - £;-ary-ra-cube with A; = 2 and Wrap- Around Torus Connections 



dimension. Many popular A;-ary-ra-cubes networks in use today set k = 2 or k = 3 to build mesh 
(See Figure 3.5) or cube (See Figure 3.6) structures [Dal87]. For these networks, the distances 
between components can be made uniformly short such that the switching latency dominates the 
transit latency. When constrained to three-dimensional space, larger values of k, will tend to have 
transit latencies which scale as Cl(\/W). Toroidal A;-ary-ra-cubes can be used to cut the worst case 
switching latency in each dimension in half and avoid hot-spot problems in simple A;-ary-ra-cubes 
(See Figure 3.7) [DS86]. 
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Figure 3.8: 16 X 16 Omega Network Constructed from 2x2 Crossbars 

3.1.5 Flat Multistage Networks 

A multistage network distributes each hypercube routing element spatially so that fixed-degree 
switches can be used for routing. Like the hypercube, routing can occur in a distributed manner 
requiring only log r (N) stages between any pair of nodes in the network. Here r is a constant 
known as the radix which denotes the number of distinct directions to which each routing switch 
can route. Unlike the hypercube and £;-ary-ra-cube, the multistage network does not provide any 
locality. The number of switches required by a multistage network grows as 0(N \og( Nj). The 

3 i 

best-case packaging volume grows as ©(JVz) and the transit latency grows as &(yN) like the 
hypercube [LR86]. 

Quite a variety of networks can be classified as multistage networks including: Butterfly net- 
works, Banyan networks, Bidelta networks [KS86], Benes networks, and Multibutterfly networks. 
Figures 3.8 through 3.11 show some popular multistage networks. Each stage in these networks 
routes by successively subdividing the set of possible destinations into a number of equivalence 
classes equal to the radix of the routing components. For example, consider a radix-2 network. 
When connections enter the network, any input can reach any destination. The first stage of routing 
components divides this class into two different equivalence classes based on desired destination. 
Each succeeding network stage further subdivides a previous stage's equivalence classes into two 
more equivalence classes. When there is a single destination in each equivalence class, the network 
has uniquely determined the desired destination and can connect to the destination endpoints. This 
successive subdivision can be easily seen in the network shown in Figure 3.9. 

3.1.6 Tree Based Networks 

Properly constructed, a tree-based, multistage network avoid the major liabilities associated 
with the standard multistage networks. Specifically, we consider fat-tree networks as described 
in [Lei85] and [GL85] and shown in Figure 1.2. The switching delay remains 0(\og(Nj) as 
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Figure 3.9: 16 X 16 Bidelta Network 




Figure 3.10: Benes Network 



with hypercubes and multistage networks. Routing may occur in a distributed fashion. Unlike the 
multistage networks described above, the tree-based networks do allow locality exploitation. When 
the bandwidth between successive stages of the tree is chosen appropriately, the tree structures can 
be arranged efficiently in three-dimensional space; switching and wiring resources grow as &(N) 
and transit latency will grow as ®(\/W). While a tree-based network may have less cross-machine 
bandwidth than a hypercube with the same number of nodes, the tree-based machine requires 
0(log(iV)) less interconnect hardware. As a result, if one were to compare machines of the same 
size, taking into account three-dimensional space restrictions, the tree machine provides at least as 
much bandwidth while supporting 0(log(iV)) more nodes. Leiserson shows that properly sized fat 
trees can efficiently perform any communication performed by any other similarly sized network 
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Figure 3.11: 16 X 16 Multibutter fly Network 



[Lei85]. 



3.1.7 Express Cubes 

Express cubes [Dal91] are a hybrid between a tree- structure and a A;-ary-ra-cube (See Fig- 
ure 3.12). By placing interchange switches periodically in a £;-ary-ra-cube, the switching delay can 



be reduced from 0(viV) to ©(log(iV)). Done properly, the transit latency remains ®(vN). If 
we allow several different kinds of switching elements in the network, the size of each switching 
element can be limited to a fixed size. 



3.1.8 Summary 

Table 3.1 summarizes the major characteristics of the networks reviewed here. Asymptotically, 
at least, we see that fat trees and express cubes have the slowest growing transit and switching 
latencies while maintaining the slowest resource growth. For a limited range of network sizes, 
flat multistage networks and A;-ary-ra-cubes may offer reasonable, or even superior, performance at 
reasonable hardware costs. 



3.2 Wire Length 

In this chapter, we have introduced many networks which have wires whose length is a function 
of the network size. We call a long wire any single run of wire between two switches which has a 
transit time in excess of the rate at which we could otherwise clock data between the switches. If 
we required the data to traverse any such wires in a single clock cycle, we would have to increase 
the clock period to accommodate the longest wire in the system. The longest wires in many of 



these network will be Cl(\/N) due to spatial constraints in three-dimensions. Requiring data to 
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Shown above is a portion of an express mesh after [Dal91]. The components labelled with 
an / are interchange units which allow connections to be routed along express channels, 
thereby bypassing intermediate switching nodes. 

Figure 3.12: Express Cube Network - k = 2 
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Table 3.1: Network Comparison 

traverse these wires in a single clock cycle would require our clock period to increase comparably 
with network size. However, if we pipeline multiple bits on the long wires, we do not have to adjust 
the clock frequency to accommodate long wires. Our notion of transit latency as proportional to 
interconnection distance (Equation 2.2), will still hold. Instead of being a continuous equation as 
given, it becomes discretized in units of the clock period, t c . 



T t = E 



■U 



:3.1; 



Equation 3.1 explicitly breaks the total distance into segments (rf 8 ) between each pair of switching 
elements in the path between the source and destination nodes to properly account for the effects of 
this discretization. Techniques for ensuring correct operation when bits are pipelined on the wires 
are detailed in Section 6.9. 
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3.3 Fault Tolerance 

In order to achieve fault tolerance in the network, we need multiple, distinct paths between any 
pair of nodes. The more distinct paths our network supports, the more robust the network will be 
to faults occurring in the network. In this section, we look at the multipath nature of the practical 
low-latency networks identified in the previous section. 

3.3.1 Indirect Routing 

If we allow indirect routing, all of the networks examined in this chapter have multiple paths. 
With indirect routing, a message may be routed from a source to a destination node by first routing 
the message through one or more intermediate nodes in the network. That is, when the source 
cannot reach the destination directly through the network, it is often possible for it to reach another 
processing node in the network which can, in turn, reach the desired destination node. If we allow 
arbitrary indirect hops through the network, any message can eventually be routed as long as the 
transitive closure of the non-faulty direct interconnect covers all the nodes in use in the network. 

While indirect routing will allow messages to eventually reach their destination, they do so at 
an increase in latency. Latency increases due to several effects. First, since messages must cross 
the network multiple times. Additional overhead is generally required to allow indirection and 
process messages requiring re-routing. Also, contention latency is increased since each indirected 
message consumes network bandwidth on each hop through the network. 

3.3.2 £;-ary-n-cubes and Express Cubes 

Direct, cube-based networks, like the A;-ary-ra-cube or the express cube, function by indirect 
routing. Each node is connected to 0(k) neighbors in a regular pattern and all routing is achieved 
by sending the message to a neighbor node which, generally, moves the message closer to the 
desired destination. At every hop, the message has a choice of paths to take to the destination, 
many of which would require the same transit and switching latency. The underlying network thus 
provides the requisite multiple paths. It is then up to the routing algorithm to efficiently utilize 
them. If our routing algorithm is omniscient about faults in the network, it can always find the 
shortest path between points in a faulty network. For many faults, the length of the shortest paths 
between close nodes will increase. However nodes which are further apart will see no increase in 
transit or switching latency. The more distant two nodes are from each other, the more minimum 
length paths there will be between them. 

3.3.3 Multiple Networks 

A simple technique for adding adding fault tolerance to a network which works for all kinds 
of networks is to simply replicate a base network. We give each node a connection to each of the 
networks. As long as there is a non-faulty path on some network between any pair of nodes which 
must communicate, normal communication may occur with no degradation in switching or transit 
latency. The originating node need only choose which network to use for each message it needs to 
deliver. Additionally, the existence of multiple networks increases the bandwidth available in the 
network and hence can reduce contention latency if utilized efficiently. Unfortunately, the gain in 
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Two four-stage networks connecting 16 endpoints are attached together at the endpoints. 
Each component is a 2 x 2, dilation- 1 crossbar. 

Figure 3.13: Replicated Multistage Network 

fault-tolerance is small compared to the costs. Each additional path through the network requires 
that we construct a complete copy of the original network. Multiple, multistage style networks are 
used in the telecommunications field to minimize contention and increase available bandwidth over 
single-path networks [Hui90]. Figure 3.13 shows a 2-replicated bidelta network. 

Replicated networks do have one advantage over pure indirect routing schemes including most 
cube style networks. With multiple networks, each node does have multiple connections both to and 
from the network. As noted in Section 2.1.2 multiple network i/o connections are key to avoiding 
a single point of failure which may sever a node completely from the interconnection network. 

3.3.4 Extra-Stage, Multistage Networks 

When using multistage interconnection networks one can construct extra-stage networks with 
more switching stages than are actually required to uniquely specify a destination ([LP83] , [CYH84] 
et. al.) (See Figures 3.10 and 3.14). The set of routing specifications that reach the same physical 
destination defines a class of equivalent paths. So long as one path of each such class remains intact 
in a faulty extra-stage network, any endpoint will be able to successfully route to its destination. The 
extra stages in these schemes result in larger switching and transit latencies than the corresponding 
baseline network, even in the absence of faults. 

If extra stages are added, but the single connection into and out-of each node is retained, extra- 
stage networks retain a single-point of failure where the nodes connect to the network. To eliminate 
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Figure 3.14: Extra Stage Network 

this problem, the extra-stage network should be constructed such that multiple network endpoints 
can be assigned to each node of the network. 

3.3.5 Interwired, Multipath, Multistage Networks 

Multibutterfly style, multipath networks are multistage networks which use dilated crossbar 
routing components. In addition to being characterized by the radix of the switching element, each 
dilated crossbar router is characterized by its dilation, d. The dilation is the number of logically 
equivalent outputs in each distinct direction. With a dilation greater than one, redundant routing is 
provided in each routing direction. Figure 3.11 shows an example of such a network. Figure 3.15 
shows some configurations for the dilated routing elements used in Figure 3.1 1. 

This class of multipath networks has a large number of distinct paths between each pair of 
nodes. The number of different switches in a stage which can be used to route between any pair of 
routers increases toward the center of the network. Up to the center of the network, the number of 
routers in any path grows by a factor of the dilation with each successive stage. Past the center of 
the network, the sorting function performed by the network limits the number of routers in the path 
to the desired destination. For those later stages, all routers which are in the path to the destination 
are candidates for use in routing any connection. 

For a given number of node connections, the multibutterfly style networks generally have more 
paths than the comparable replicated network. Consider a k-replicated network. A multipath 
network can be constructed from the k-replicated network by taking each of the k routers in the 
same location in each of the k-replicated network and creating one dilated router out of them with 
dilation, d = k. This will give us a multibutterfly style network. Note that in the replicated network, 
we were only able to chose which resources to use when the message entered the network. In the 
multibutterfly network we have the option of switching between networks at each routing stage. 
Thus, there are many more paths through the multibutterfly networks. The fine details of how one 
wires these redundant paths are discussed in Section 3.5. 
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Figure 3.15: 4 x 2 Crossbar with a dilation of 2 

By constructing fat-tree networks using dilated crossbar routers, it is possible to build multipath, 
fat-tree networks which exhibit the same basic properties. The tree networks will need multiple 
connections into and out-of the network to avoid single points of failure. Connections made through 
higher tree-levels have more paths between the source and the destination as they traverse more 
dilated routers in the network. 

3.4 Robust Networks for Low-Latency Communications 

Given our need for fault tolerance and low latency, the classes of networks which are most 
attractive are express cubes and multipath, fat-tree networks. For smaller networks, A;-ary-ra-cubes 
and flat multipath, multistage networks are also worth considering. Because of the acyclic nature 
of multistage routing networks, it is easier to devise robust and efficient routing schemes for this 
class of networks. Consequently, we will focus on multistage networks for the remainder of this 
document. 

3.5 Network Design 

This section discusses many of the issues relevant to designing a high-performance, robust, 
multipath, multistage routing network. The space of possible multipath networks is quite large, and 
some of the decisions made when selecting a particular network can make a significant difference 
in the fault tolerance and performance of the network. In addition to the basic parameter selection, 
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N 


total number of nodes on the network 


ni 


input ports from each node to the network 


no 


output ports from each node to the network 


i 


input ports per router 





output ports per router 


r 


router radix 


d 


router dilation 


w 


channel width 



Table 3.2: Network Construction Parameters 

the detailed network wiring scheme can have a notable affect on the performance of the resulting 
network. Many of the wiring issues are easier to describe and understand using small, flat, multipath, 
multistage interconnection networks. As a result, the examples and development which follow are 
given in terms of this class of networks. Nonetheless, the same design principles apply when 
developing multipath, fat-tree networks. 

3.5.1 Parameters in Network Construction 

Table 3.2 summarizes several parameters which will be used in this section when characterizing 
a network. Radix and dilation were introduced in Sections 3.1.5 and 3.3.5. ni and no quantify the 
number of connections between each node and the network, i and o are the number of connections 
in and out of each router. Generally, i = o = r ■ d. Since the number of inputs and the number of 
outputs on the routing components are the same, we say the routers are square. When we use square 
routers, the aggregate bandwidth between stages in flat, multistage networks remains constant. 

3.5.2 Endpoints 

The network endpoints are the weakest link in the network. If we are designing a network with 
a yield model in mind, in the worst case, we can sustain only min(m, no) faults. If we are designing 
a network with a harvest model in mind, in the worst case each min(m, no) faults will remove an 
additional node from the operational set. 

Once ni and no are chosen, we must also ensure that these connections are utilized effectively. 
Particularly, to maximize robustness, each must link connect to a distinct routing component in the 
network. Note, for instance, in the network shown in Figure 3.11, that dilation- 1 routers are used in 
the final stage of the network. These dilation- 1 routers are used to achieve maximal fault tolerance 
by ensuring that the maximum number, no = 2, of distinct routers provide output connections from 
the network to each node. Figure 3.16 shows another alternative for using dilation- 1 routers in the 
final stage. Rather than using d times as many routers with dilation- 1 and the base radix unchanged, 
the network in Figure 3.16 uses routers which increase the radix by a factor equal to the dilation 

(i.e. rfi na l^tage = O = T ■ d). 
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Figure 3.16: 16 X 16 Multibutterfly Network with Radix-4 Routers in Final Stage 

3.5.3 Internal Wiring 

Inside a multipath network, we have considerable freedom as to how we wire the multiple 
paths between stages. As described in Section 3.1.5, multistage networks operate by successively 
subdividing the set of potential destinations at each stage. All inputs to routing components in 
the same equivalence class at some intermediate network stage are logically equivalent since the 
same set of destinations can be reached by routing through those components. If we exercise this 
freedom judiciously, we can maximize the fault-tolerance and minimize the congestion within the 
network, and hence minimize the effects of congestion latency. 

Path Expansion 

A simple heuristic for achieving a high degree of fault tolerance is to wire the network to 
maximize the path expansion within the network. That is, we want to select a wiring which allows 
the connection between any two endpoints to traverse the maximum number of distinct routing 
components in each stage. Maximizing path expansion improves fault-tolerance by maximizing 
the redundancy available at each stage of the network. 

Let S be the total number of routing stages in the network. The number of paths between a 
single source-destination pair expands from the source into the network at the rate of dilation, d. 
Thus, we have pi n (s), the number of paths to stage s given by Equation 3.2. 

' x = ni x d [s ~ 1] 



Pv 



:3.2i 



After a stage in the network, the paths will have to diminish in order to connect to the proper 
destination. Looking backward from the destination node, we see that the paths must grow as the 
network radix r. This constraint is expressed as follows: 

Pout (s) = nox r[( 5+1 )" s ] (3.3) 
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3 
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p(s) 
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Table 3.3: Connections into Each Stage 

These two expansions must, of course, meet at some point inside the network. This occurs when 
Pi n and p out are equal. Let us call this turning point stage s'. s' can be determined as follows: 



j\ 



j\ 



Pout{S ) = p m (S 

ni X rf [s '" 1] = no X r[( 5+1 )" s '] 

, (S + 1) • ln(r) + In(no) + ln(d) - ln(m) 



ln(rf) + ln(r 
(5+l)-ln(r) + ln(^ 



ln(d • r) 



(3.4) 



Once Equation 3.4 is solved for s', we can quantify the number of connections into each stage of 
the network by Equation 3.5. 



p(s) 



ni X <$ s ^ s < s' 

min(ni ■ d^ s ~ [ \no- r^ s+[ ^~^) s = s' 
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x r [(S+l)-s] 



s > s 



Note that Equation 3.5 expresses the maximum achievable number of paths between stages for a 
single source-destination pair. This is effectively an upper bound on the path expansion in any 
dilated multipath network. The total number of distinct paths between each source and destination 
simply grows as Equation 3.2 and is thus given by Equation 3.6. 



Ptotal(S) 



ni X 



rfl 5 " 1 ! 
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For example, consider the network in Figure 3.11 {ni = no = r = d = 2, S = 4). Solving 
Equation 3.4 for s', we find s' = 3. The number of connections into each stage can then be 
calculated as shown in Table 3.3. The total number of paths is simply 2x2= 16. Noting 
Figure 3.11, we see it does achieve this maximum path expansion for the highlighted path; the 
paths between all other source and destination pairs in Figure 3.11 also achieve this path expansion. 

a- (3 Expansion 

Unfortunately, path expansion can be a naive metric when optimizing the aggregate fault- 
tolerance and performance of a network. Path expansion looks at a single source-destination pair 
and tries to maximize the number of paths between them. If we only considered path expansion in 
selecting a network design, many nodes could share the same sets of routers and connections in their 
paths through the network. This sharing would lead to a higher-degree of contention. Additionally, 
when faults accumulate in the network, a larger number of nodes are generally isolated from the 
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Figure 3.17: Left: Non-expansive Wiring of Processors to First Stage Routing Elements 
Figure 3.18: Right: Expansive Wiring of Processors to First Stage Routing Elements 

rest of the network at once. Consider, for instance, the two first stage network wirings shown 
in Figure 3.17 and 3.18. Both wirings are arranged such that each processor connects to two 
distinct processors in the first stage of routing. However, the wiring shown in Figure 3.17 has four 
processors which share a pair of routers, whereas any group of four processors in the wiring shown 
in Figure 3.18 is connected to five routers in the first stage. As a result, there will generally be less 
contention for connections through the first stage of routers in the latter wiring than in the former. 
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Leighton and Maggs introduced a-fi expansion to formalize the desirable expansion properties 
as they pertain to groups of nodes which may wish to communicate simultaneously [LM89]. 
Informally, a-fi expansion is a metric of the degree to which any subset of components in one stage 
will fan out into the next stage. More formally, we say a stage has a-fi expansion (a, fi) if any 
subset of a components from one stage must connect to at least a x fi components in the next stage. 
fi is thus an expansion factor which is guaranteed for any set of size a. Networks with favorable 
a-fi expansion are networks for which the a-fi expansion property holds with higher fi for each 
value of a. The more favorable the a-fi expansion, the more messages can be simultaneously 
routed between any sets of communicating processors, and hence the lower the contention latency. 

Networks Optimized for Yield 

If we cannot tolerate node loss, and hence wish to optimize the fault-tolerance of the network 
as a yield problem, then it makes sense to focus on achieving the maximal path expansion first, 
then achieving as large a degree of a-fi expansion as possible. Unfortunately, there is presently no 
known algorithm for achieving a maximum amount of a-fi expansion, so the techniques presented 
here are heuristic in nature. 

To achieve maximum path expansion, we connect the network with the algorithm listed in 
Figure 3. 19 [CED92]. The paths from any input to any output may fanout by no more than a factor 
of d, the dilation of the routers, at each stage. This fanout may also become no larger than the size of 
the routing equivalence classes at that stage. The routine groupsz returns the maximum fanout size 
allowed by both of these factors. Each stage is partitioned into fanout classes of this size, which 
are then used to calculate network wiring. The maximum path fanout described in Equation 3.5 is 
achieved by this algorithm for all pairs of components. 

As introduced above, the last stage is composed of dilation- 1 routers to increase fault tolerance. 
Figure 3.20 shows a deterministically-interwired network composed of radix- 2 routers. 

Networks Optimized for Harvest 

To achieve a high harvest rate and maximize performance, we want to wire networks with a high 
degree of a-fi expansion. As introduced above, there are no known deterministic algorithms for 
achieving an optimal expansion. In practice, randomized wiring schemes produce higher expansion 
than any known deterministic methods. [Kah91] presents some of the most recent work on the 
deterministic construction of expansion graphs. [Upf89] and [LM89] show that randomly wired 
multibutterfiies have good expansion properties. The high expansion generally means there will 
be less congestion in the network. Additionally, Leighton and Maggs show that after k faults have 
occurred on a N node machine, it is always possible to harvest N — O(k) nodes [LM89]. 

As introduced in Section 3. 1 .5, multistage networks operate by successively subdividing the set 
of potential destinations at each stage. All the inputs to routing components in the same equivalence 
class at some intermediate stage in the network, are logically equivalent. After the routing structure 
determines which set of outputs in one stage must be connected to which set of inputs in the 
following stage, we randomly assign individual input-output pairs within the corresponding sets. 
Figure 3.21 shows the core of an algorithm for randomly wiring a multibutterfly. The algorithm was 
first introduced in [CED92] and is based on the wiring scheme described in [LM89]. In practice, 



39 



> Returns the next-stage router to which to wire for maximum path expansion 
wire_to_port(n,rfp ,s) 

> n=router number, rf p =dilated port number, s=router stage 

1 outgrpsz <— groupsz(s) 

2 ingrpsz <— groupsz(s + 1) 

3 eqjstart <— ingrpsz x |_n/(r x d x outgrpsz)\ 

> offset to beginning of fanout class 

4 eqjrouter <— ((n x rf + rf p ) mod ingrpsz) 

> offset to specific chip within fanout class 

5 return(eg^start + eqjrouter) 

> Calculates size of fan-out class 
groupsz(s) 

1 expansion <— ni x rf s+1 > maximum fanout due to dilation 

2 eq_class <— no x r- s + 1_s > equivalence class size 

3 return(min(e xpansion, eq_class)) 



This algorithm generates a network designed to maximize path expansion. Each endpoint 
will have the maximum number of redundant paths possible through this type of network 
(boundary cases omitted for clarity). 

Figure 3.19: Pseudo-code for Deterministic Interwiring 




Figure 3.20: 16 X 16 Path Expansion Multibutterfly Network 
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> inset contains all the input ports of a single equivalence class in the next stage. 

> connections is an array matching in_ports and outsorts, initially empty. 

> outjportsJist lists the output ports of a single equivalence class in the 

> current stage. 

wire_eq_class(m_set, connections, out_portsJist) 

1 foreach out_port 

2 import <— choose and remove a random input port from inset 

3 while(connected(router#(in_port), router#(out_port), connections)) 

4 put in_port back in inset 

5 in_port <— choose and remove a random input port from inset 

6 connect(in_port, out_port, connections) 

7 retum(connections) 

coimected(in_r outer, out_router, connections_array) 

1 if in jr outer is already connected to out jr outer 

2 return(true) 

3 else return(f alse) 



This algorithm randomly interwires an equivalence class. To interwire a whole stage, the 
algorithm is repeated for each class (boundary cases omitted for clarity). 

Figure 3.21: Pseudo-code for Random Interwiring 

one would generate many such networks, compare their performance as described in Sections 3.5.4 
and 3.5.5, and pick the best one. Experience indicates that most such networks perform equivalently. 
The testing, however, assures that one avoids the unlikely, but possible, case in which a network 
with poor expansion was generated. Figure 3.22 shows a network constructed with this algorithm. 

Hybrid Network Compromise 

Chong observed in [CK92] that one can achieve maximum path expansion while introducing 
some randomized expansion to minimize congestion. The result is a network which is a hybrid 
between the two described above. The basic strategy used in wiring such, randomized, maximal- 
fanout networks is to further subdivide each routing equivalence class into fanout classes. Instead of 
randomly wiring from all outputs destined for a given equivalence class to the inputs on all routers 
in that equivalence class in the subsequent stage, the dilated outputs from each router are each 
sent to different fanout classes within the appropriate routing equivalence class (See Figure 3.23). 
Figure 3.24 sketches the algorithm used for wiring up these networks. Figure 3.25 shows an 
example of such a network. 
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A randomly-interwired, four-stage network connecting 16 endpoints. Each component in 
the first three stages is a 4 x 2, dilation-2 crossbar. To prevent any single component from 
being in an endpoint's critical path, the last stage is composed of 2 x 2, dilation- 1 crossbars. 

Figure 3.22: Randomly-interwired Network 

3.5.4 Network Yield Evaluation 
Yield 

As a simple metric for evaluating the yield characteristics of these multipath networks, we 
consider the probability that a network remains completely connected given a certain number of 
randomly chosen router faults. These Monte Carlo experiments model only complete router faults 
to show the relative fault-tolerant characteristics of these networks while containing the size of the 
fault-space which must be explored. 

The experiment proceeds by placing one randomly chosen fault at a time until the network 
becomes incomplete. The basic process is repeated on the same network for enough trials to 
achieve statistically significant results. Results are tabulated to approximate the probability of 
network completeness for each fault level. We also derive the expected number of faults each 
network can tolerate. 

Because the routing components in the final stage of our multipath networks are half the size 
of routers in the previous stages, we assign two such routers to one physical component package 
and label both routers faulty if the physical component is chosen to be faulty. Furthermore, the 
two routers are assigned so that removing any such pair will not cut off an endpoint. We make 
this assignment so that fault increments will be of constant hardware size. This assignment also 
simulates how the pair of 4 x 4, dilation- 1 routers in an RN1 routing component (See Chapter 8) 
may be assigned. 

We generated three-stage and four-stage networks for each of the types of networks described 
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Equivalence Classes 



Fanout Classes 



The above figure shows how to achieve maximal fanout while avoiding regularity. The 
routers shown are radix-2 and dilation-2. At stage s, we divide each routing equivalence 
class into ni ■ d^ s ~ ') fanout classes until each fanout class contains a single router. Random 
wirings are chosen between appropriate fanout classes to form fanout trees. The disjoint 
nature of fanout classes ensures that fanout-trees will have physically distinct components. 

Figure 3.23: Randomized Maximal-Fanout (diagram from [CK92]) 

above, each connecting 64 and 256 endpoint nodes respectively. Each endpoint has two connections 
to and from the network (ni = no = 2) to provide for the minimal amount of redundancy necessary 
to achieve fault tolerance. Every network uses radix-4 routers of dilation-2 and dilation- 1 and 
hence could be implemented using the RN1 component. All the networks with a given number of 
stages contain the same number of components. Network wiring is solely accountable for the fault 
tolerance and performance differences of these networks. 

For each network, the yield probability of the network is plotted against the number of uni- 
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wire_stage(s) > s=routing stage 

1 prev ..expansion <— ni x d s+l > maximum fanout to stage s due to dilation 

2 prevjeq-dass <— no x r^" 5 " 1 " 1 ) -8 ) > equivalence class size at stage s 

3 prev _fanout _class < prel ' "'''' - *''" ss > fanout class size at static s 

^ J prev -expansion ° 

4 expansion <— ni x rf( s + 2 ) > maximum fanout to stage s + 1 due to dilation 

5 eq-dass ^ no x r(( s+1 ) - ( s+1 )) > equivalence class size at stage s + 1 

6 fanout-class < eg ~ c ass > fanout class size at stage s + 1 

J expansion & ' 

7 if (fanout-class > 1) 

8 foreach fanout equivalence class in stage s 

9 create (r x d) different output-port lists, 

one for each output from a routing switch 

> each of these lists will contain prev. fanout. class ports 

10 foreach output-port list identified, identify the fanout xlass 

routers in stage s + 1 to which these ports should be 
connected - the inputs on these routers make up the 
corresponding in-port list 

1 1 Use wire_eq_class to randomly interconnect each in-port list 

to each corresponsding output-port list 

12 else 

13 foreach equivalence class in stage s 

14 create r different output-port lists, 

one for each logically distinct output direction from a router 

> each of these lists will contain (prev-eq-dass x d) ports 

15 foreach of the output-port lists identified, identify the eq xlass 

routers in stage s + 1 to which the output list should be 
connected - the inputs on these routers make up the 
corresponding in-port list 

16 Use wire_eq_class to randomly interconnect each in-port list 

to each corresponsding output-port list 



This algorithm describes how to wire random, maximal-fanout networks using the random 
interwiring algorithm, wire_eq_class shown in Figure 3.21 (boundary cases omitted for 
clarity). 

Figure 3.24: Pseudo-code for Random, Maximal-Fanout Interwiring 
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Figure 3.25: 16 X 16 Randomized, Maximal- Fanout Network 
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(A) 3-Stage Network Completeness 




10 3.1 6.3 9.4 12.5 15.6 18.S 
Percent Failure 

(B) 4-Stage Network Completeness 



The probability that a network with a given number of faults is complete for the randomly- 
interwired, path expansion, and random maximal fanout, 3-stage and 4-stage networks. (A) 
Each 3-stage network uses 48 radix-4 components to interconnect 64 endpoints. (B) Each 
4-stage network uses 256 radix-4 components to interconnect 256 endpoints. 



Figure 3.26: Completeness of (A) 3-stage and (B) 4-stage Multipath Networks 

formly distributed random faults. Results for the three-stage and four-stage networks are shown 
in Figure 3.26. The expected number of faults that each network can tolerate is summarized in 
Table 3.4. 
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Network 
Stages Wiring 


Total 
# Comp. 


Test 
Trials 


Expected Failure 

Tolerated 

# Faults % Network 


Error 
Bound 

# Faults 


3 


Random 


48 


1000 


5.0 


10% 


0.063 


3 


Path Expansion 


48 


1000 


8.1 


16% 


0.079 


3 


Random Max Fanout 


48 


1000 


5.2 


11% 


0.060 


4 


Random 


256 


5000 


11.8 


4.6% 


0.075 


4 


Path Expansion 


256 


5000 


22.6 


8.8% 


0.130 


4 


Random Max Fanout 


256 


5000 


12.5 


4.9% 


0.069 



The above table shows the expected number of faults each network can tolerate while 
remaining complete. Each network was fault tested as described in section 3.5.4 for the 
indicated number of trials. 

Table 3.4: Fault Tolerance of Multipath Networks 

Wiring Extra-Stage Networks for Fault Tolerance 

It is worth noting that we can achieve the same fault tolerance as indicated in this section 
without using dilated routers. Consider replacing each of the dilated routers used in the networks 
above with an equivalently sized (i.e. same number of inputs, i, and same number of outputs o) 
dilation- 1 router (i.e. r = o, d = 1). The network we end up with is an extra-stage network since 
we have increased the radix while leaving the number of stages the same. Form a fault tolerance 
perspective, this resulting extra-stage network has the same yield probability as the corresponding 
dilated network. As a result, the network wiring issues introduced in Section 3.5.3 apply equally 
well to extra-stage, multistage networks as they did to dilated, multistage networks. 

Performance Degradation in the Presence Faults 

We are also interested in knowing how robust the network performance is when faults accumu- 
late. To that end, we consider a simple synthetic benchmark on the complete networks at various 
fault levels. This gives us some idea of the effects of congestion in the network, as well as how the 
faults affect the overall performance of the network. The routing protocol detailed in Chapter 4 is 
used for all of these simulations. 

Our synthetic benchmark, FLAT24, was designed to be representative of a shared-memory 
application. FLAT24 uses 24-byte messages with a uniform traffic distribution. FLAT24 generates 
0.04 new messages per router cycle based on the assumption that the network is running at twice 
the clock rate of the processor and a data-cache miss rate of 15%. The application is assumed to 
barrier synchronize every 10,000 cycles, or every 400 messages. Modeling barrier synchronization 
exposes the effects of localized degradation. If a small number of nodes have significantly fewer 
paths through the network than the rest of the nodes, the nodes with less connectivity will fall 
behind those with more. In a real application, these nodes will tend to hold up the remainder of the 
application since they are not progressing as rapidly as the rest of the nodes in the network. The 
periodic barrier synchronization is a simple and pessimistic way of limiting the extent to which 
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Comparative I/O bandwidth utilization and latencies for 3-stage and 4-stage random and 
path expansion networks on FLAT24. Recall from Table 3.4 that expected percentages of 
failure tolerated by random and deterministic networks are, respectively: 10% and 16% for 
3-stages; and 4.6% and 8.8% for 4-stages. Note that the performance degradation appears to 
level off because only complete networks are measured. Although the surviving networks 
suffer less degradation as percentage of failure increases, the number of surviving networks 
is becoming substantially smaller. 

Figure 3.27: Comparative Performance of 3-Stage and 4-Stage Networks 

nodes may get ahead of each other and hence exposing the effects of this localized degradation. 
This synthetic application and the simulations in general are described in detail in [Cho92] ; most 
relevant details are reprinted in Appendix A. 

Figure 3.27 shows the performance degradation of FLAT24 on the surviving networks as various 
fault levels. Here latency is the average time from when a message is injected into the network until 
the time its reply and acknowledgment are received. I/O bandwidth utilization measures the average 
fraction of network outputs which are receiving or replying to successful message transmissions at 
any point in time. This provides a measure of the useful bandwidth provided by the network. 
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1 Begin with all nodes live. 

2 Determine the I/O-isolated nodes and remove them from the 

set of live nodes. 

3 Each faulty chip leading to at least one live node is declared to be blocked. 

Propagate blockages from the outputs to the inputs according to 
the definition of blocking given below. 

4 If all of a node's connections into the first stage of the network 

lead to blocked chips, remove the node from the set of live nodes. 



This algorithm harvests the nodes in a networks which retain good connectivity in the 
presence of faults. The algorithm will sacrifice nodes which still retain weak connectivity 
in order to maximize the performance of the harvested network. 

> A router is said to be blocked if it does not have at least one unused, operational output 
port in each logical direction which leads to a router which is not blocked. 

> An I/O-isolated node is a node which has lost all of its input connections to the first stage 
of the network or all of its output connections from the final stage of the network. 

Figure 3.28: Chong's Fault-Propagation Algorithm for Reconfiguration 

3.5.5 Network Harvest Evaluation 

To evaluate the harvest rate of a network with faults, we use the reconfiguration algorithm 
suggested by Chong in [CK92]. This reconfiguration algorithm identifies all nodes with "good" 
network connectivity. The algorithm does not necessarily identify all nodes which retain full 
connectivity in the network as available in the harvested network. Since it is the overall system 
performance that matters, not simply the number of nodes available for computation, Chong ob- 
serves that better overall performance is achieved when nodes with low bandwidth into the network 
are eliminated from the set of nodes used for computation. Chong's algorithm is summarized in 
Figure 3.28. 

Figure 3.29 shows the harvest rate for a 5-stage, radix-4, dilation-2 (1024 node) network. Also 
shown is the degradation in application performance assuming that the application can be efficiently 
repartitioned to run on the surviving processors. 

3.5.6 Trees 

Fat-trees have the same basic multipath, multistage structure as the multistage networks de- 
scribed so far in this section. It is easiest to think of each fat-tree network as two sub-networks. 
One sub-network routes from the root of the tree down to the leaves. This portion looks almost 
identical to the routing performed by the multistage networks that have been discussed. Particularly, 
this downward routing network performs the same recursive subdivision of possible destinations 
at each successive routing stage. The other sub-network allows connections to be routed up to 
the appropriate intermediate tree level and then cross over into the down routing sub-network. 
In fact, we could think of the flat, multistage networks as a tree which had a degenerate up and 
crossover sub-network. In these networks, the up network is simply set of wires which connect all 
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Figure (A) shows the percentage of node loss under the criterion of fault-propagation. Fig- 
ure (B) compares the performance of the randomly wired multibutterfiy with the randomized 
maximal fanout network. 



Figure 3.29: Fault-Propagation Node Loss and Performance for 1024-Node Systems (from [CK92]) 

network input connections directly into the root of the tree. It is the upward routing portion of the 
tree networks which give them their ability to exploit locality. Two nodes close to each other can 
cross over low in the tree structure and avoid traversing a large number of routers or consuming 
bandwidth near the root of the tree. 

Fat-Trees 

Fat-trees are distinguished from arbitrary tree based networks in that the interconnection band- 
width increases towards the root of the tree. The internal tree connections closer to the root require 
more bandwidth because they service a larger number of nodes below them. For instance, in a 
binary fat-tree the root of the tree will see all traffic that is not constrained solely to either half of 
the machine. The property that makes fat-tree structures most attractive is their universality prop- 
erty. Leiserson shows that, when the rate of bandwidth growth in the fat-tree is chosen properly, 
fat-trees can be volume universal. That is, a properly constructed volume &(V) fat-tree network 
can simulate any volume &(V) network in polylogarthmic time [Lei85] [Lei89] [GL85]. 

The key observation in demonstrating the universality of various fat-tree structures, is that the 
physical world places constraints on the ratio between the volume of a region and the wire channel 
capacity, and hence bandwidth, which can efficiently enter or leave that volume. The channel 
capacity into a volume is limited by the surface area surrounding that volume. As we scale up 
to larger systems and hence larger volumes, the surface area of a given volume, V, grows only 

2 2 

as &(Vi). To remain volume efficient, the channel capacity can only grow as &(V^). If the 
channel capacity grows faster than this, then the size of the system packaging is limited by the 
channel capacity between regions rather than the volume of the system being packaged. As the 
system becomes large, pieces of the system must be placed further apart due to the interconnection 
bandwidth constraints. As a result, the universality property will not hold because the number of 
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processors per unit volume is decreasing as the system increases in size. If the channel capacity 
grows slower than this, the universality property does not hold due to insufficient channel capacity 
to support the potential message traffic. For binary fat-trees Leiserson shows that channel capacity 
should increase as \/4 per stage toward the root of the tree in order to achieve volume universality. 
Fat-trees also have considerable flexibility. When other pragmatic issues dictate a structure 
that allows more channel capacity, and consequently more bandwidth, at higher tree levels than 
is appropriate for volume-universality, the basic fat-tree structure can accommodate the increased 
interstage capacity. This additional channels will allow additional fault tolerance and lower the 
network's contention latency, 7. 

Building Fat-Trees 

We can build fat-tree networks with the same fixed-size, dilated routers which we have used to 
construct flat, multistage networks. The use of such routers in the down sub-network is obvious 
since the down sub-network performs the same sorting function as in the flat networks. Here, 
the router radix defines the arity of the fat-tree. The up routing sub-network needs to expand the 
possible destinations so that a given route may make use of a large portion of the bandwidth at 
some higher tree stage. The up routing sub-network also needs to provide switching which allows 
periodic crossover to the down routing network. At the same time, the bandwidth between tree 
levels needs to be controlled to match the application requirements as described in the previous 
section. Just as with the flat-multistage networks, the endpoint connections are weak links and one 
generally wants to organize networks with multiple network connections per endpoint. Similarly, 
the issues of wiring the internal stages for fanout apply equally well here. 

As an example, consider building a fat-tree using radix-4, dilation-2 routing components. The 
down sub-networks is a quaternary tree. In the up sub-network, we use the routing components to 
switch between upward routing and crossover connections into the down sub-network. We can take 
advantage of the radix-4 switching provided by the routing component to route to several crossover 
connections at a single switching stage. As a result, we effectively create short-cut paths in the up 
routing tree. Figure 3.30 shows how a radix-4 up router can switch to three successive tree-stages 
and provide upward connection in the tree. Since each up router in the up sub-tree services three 
down-tree stages, the route to the root is only |log 4 iV long. Figures 3.31 and 3.32 shows the 
logical connectivity for the up and down sub-trees using the short-cut crossover scheme shown in 
Figure 3.30. 

3.5.7 Hybrid Fat-Tree Networks 

Fat-trees allow us to exploit a considerable amount of locality at the expense of lengthening the 
paths between some processors. Flat, multistage networks fall at the opposite extreme of the locality 
spectrum where all nodes is uniformly close or distant. Another interesting structure to consider is 
a hybrid fat-tree. A hybrid fat-tree is a compromise between the close uniform connections in the 
flat, multistage network and the locality and scalability of the fat-tree network. In a hybrid fat-tree, 
the main tree structure is constructed exactly as described in the previous section. However, the 
leaves of the hybrid fat- tree are themselves small multibutterfly style networks instead of individual 
processing nodes. With small multibutterfly networks forming the leaves of the hybrid fat-tree, 
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(To next Up router) 

(From upstream Down routers) 
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router 



(From previous Up router) 
(or processing nodes) 



(To other down routers) 



(To other down routers) 



(To other down routers) 
(or processing nodes) 



Shown above is a cross-sectional view of a fat-tree network showing a switching node in the 
up routing sub-tree and a down router in each of the three successive tree stages to which 
this up router can crossover. As shown, each router is a radix-4 routing component. Only 
a single output is shown in each logical direction for simplicity. With dilated routers, each 
dilated connection would be connected to different routers in the corresponding destination 
direction for fault tolerance. 

Figure 3.30: Cross-Sectional View of Up Routing Tree and Crossover 

small to moderate clusters of processors can efficiently work closely together while still retaining 
reasonable ability to communicate with the rest of the network. 

The flat, leaf portion of the network is composed of several stages of multibutterfly style 
switching. Each stage switches among r logical directions. The first stage is unique in that only 
(r — 1) of the r logical directions through the first stage route to routers in the next stage of the 
multibutterfly. The final logical direction through the first routing stage connects to the fat-tree 
network. The remaining stages in the leaf network perform routing purely within the leaf cluster. 
To allow connections into the leaf cluster from the fat-tree portion of the network, one r-fh of the 
inputs to the first routing stage come from the fat-tree network rather than from the leaf cluster 
processing nodes. Figure 3.33 shows a diagram of such a leaf cluster. This hybrid structure was 
introduced in [DeH90] and is developed in more detail there. 

3.6 Flexibility 

In Section 2.8, we raised some concerns about how well a network topology can be adapted 
to solve particular applications. Having reviewed the properties of these networks, we can answer 
many of the questions raised. 
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Figure 3.31: Connections in Down Routing Stages (left) 
Figure 3.32: Up Routing Stage Connections with Lateral Crossovers (right) 
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Figure 3.33: Multibutterfly Style Cluster at Leaves of Fat-Tree 

• How do we provide additional bandwidth for each node at a given level of semiconductor 
and packaging technology? 

If we assume that the semiconductor technology limits the interconnect speed, then we are 
trying to increase the bandwidth in an architectural way. With both flat multipath networks 
and multibutterflies, we can easily increase the bandwidth into a node by increasing the 
number of connections to and from the network, (i.e. ni and no). This also has the side 
effect of increasing the network fault tolerance. 
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• How do we get more/less fault tolerance for applications which have a higher/lower premium 
for faults 

The simple answer here is to increase the number of connection to and from the network, 
since this is the biggest limitation to fault tolerance. Using higher dilation routers will provide 
more potential for expansion and hence better fault-tolerance. Hybrid schemes which use 
extra-stages in a dilated network will also serve to increase the number of paths and hence 
the fault-tolerance of the network. 

• How do we build larger (smaller) machines? 

The scalability of the schemes presented here, allow the same basic architecture to be used in 
the construction of large or small machines. For very large machines, we saw that fat-trees or 
hybrid fat-trees are the best choice. For smaller machines, we saw that multistage networks 
may provide better performance. In between, the details of the technologies involved as well 
as other system requirements will determine where the crossover lies. 

• How can we decrease latency? at what costs? 

We have control over the latency in several forms. The switching latency (T s ) is directly 
controlled by the router radix, r. Increasing the radix of the router will lower the number 
of stages which must be traversed and tend to decrease latency. However, the router radix 
is limited by the pin limitations of the routing component. Increasing the radix will either 
require an increase in die- size and package pin count (and hence cost), or a decrease in 
dilation or data channel width. Decreasing dilation will tend to reduce fault-tolerance and 
increase congestion. Decreasing the data channel width decreases the bandwidth and thus 
increases both congestion and the message transmission time (T transm i t ). By increasing the 
channel width, we can decrease transmission time; again, this will either increase die-size 
and cost, or require the decrease in radix or dilation. Finally, we can decrease congestion 
by increasing router dilation or increasing the aggregate network bandwidth. Increasing the 
dilation, again must be traded off against radix, channel width, and cost. Increasing the 
number of inputs and outputs to the network will increase the aggregate bandwidth of the 
network at the cost of more network resources. 

3.7 Summary 

In this section, we have examined network topologies suitable for implementing robust, low- 
latency interconnect for large-scale computing. We saw that express-cubes and fat-trees have the 
best asymptotic characteristics in terms of latency and growth. We also saw how the multipath 
nature of these networks allows the potential for tolerating faults within the networks. For many 
networks, we see that architectures which tolerate network fault do not necessarily require additional 
network latency. The only increase in network latency results from the lower bandwidth available in 
the faulty network. We examined detailed issues relevant to wiring multistage networks. We found 
that good performance results from wiring the network to avoid congestion and that randomized 
techniques provide the best strategy currently known for achieving such network wirings. 
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3.8 Areas to Explore 

We have, by no means, explored all the issues associated with selecting the optimal network 
for every application. The following is a list of a few interesting areas of pursuit: 

1 . It is hard to provide a final head-to-head latency comparison between networks without a good 
quantification of the effects of congestion in various networks. As mentioned in Section 2.4, 
this is particularly difficult because the effects of congestion are highly dependent upon the 
network usage pattern needed by the application and the detailed network topology. A good 
quantification of congestion applicable across a wide range of networks and loading patterns 
would go a long way toward helping engineers design and evaluate routing networks 

2. In Section 3.5.4 we demonstrate that a class of extra-stage networks has the same fault-tolerant 
properties as dilated networks. These networks will generally have lower performance due 
to the necessity to make detailed routing decisions at the node rather than inside the network 
where the freedom can be used to minimize blocking. It would be worthwhile to quantify 
the magnitude of the performance improvement offered by the dilated routing components. 

3. Express-cubes have the same asymptotic network characteristics as fat-trees. We avoid 
detailed consideration of these networks at this point due to the difficulties associated with 
efficiently routing on such networks in the presence of faults. In the next chapter, we will 
show how to route effectively with faults for fat-tree and multistage networks. It would be 
interesting to see comparable routing solutions for express-cubes. 
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4. Routing Protocol 



In the previous chapter, we saw how to construct multipath networks. The organization of these 
networks offers considerable potential for low-latency communication and fault-tolerant operation. 
To make use of this potential, we need a routing scheme which is capable of exploiting the multiple 
paths with low latency. In this chapter, we develop a suitable routing scheme and show how it 
meets these needs. 

4.1 Problem Statement 

As introduced in Chapter 1, we need a routing scheme which provides: 

1 . Low-overhead routing 

2. Protocol Flexibility 

3. Distributed routing 

4. Dynamic fault tolerance 

5. Fault identification and localization with minimal overhead 

4.1.1 Low-overhead Routing 

Any overhead associated with sending a message will increase end-to-end message latency. 
There are two primary forms of overhead which we wish to minimize: 

1 . Overhead data 

2. Overhead processing 

Overhead data includes message headers and trailers added to the message. Overhead data will 
diminish the available network bandwidth for conveying actual message data. Overhead processing 
includes the processing which must be done at each endpoint to interact with the network (e.g. T p , 
T w ) and the processing each router must perform to properly process each data stream (e.g. t sw it c h)- 
Endpoint overhead processing includes: 

1 . processing necessary to prepare data for presentation to the network 

2. processing necessary to use data arriving from the network 

3. processing necessary to control network operations 

We want a protocol that satisfies the various routing requirements with minimal overhead in terms 
of both processing time and transmitted data. 
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4.1.2 Flexiblity 

In the interest of providing general, reusable routing solutions, we seek a minimal protocol for 
reliable end-to-end message transport. Specific applications will need to use the network in many 
different ways. To allow as large a class of applications as possible the opportunity to use the 
network efficiently, the restrictions built into the underlying routing protocol should be minimized. 

4.1.3 Distributed Routing 

In the interest of fault tolerance, scalability, and high-speed operation, we want a distributed, 
self-routing protocol. A centralized arbiter would provide a potential single point of failure 
and have poor scalability characteristics. Rather, we need a routing scheme which can allocate 
routing resources and make connections efficiently in practice using only localized information. A 
distributed routing scheme operating on local information has the following beneficial properties: 

• faults only affect a small, localized area 

• routing decisions are simple and hence can be made quickly. 

4.1.4 Dynamic Fault Tolerance 

To provide continuous, reliable operation, the routing scheme must be capable of handling faults 
which arise at any point in time during operation. As introduced in Section 2.1.1, transient faults 
occur much more frequently than permanent faults. Additionally, for sufficiently long computations 
on any large machine, one or more components are likely to become faulty during the computation 
(e.g. example presented in Section 2.5). 

4.1.5 Fault Identification 

Although, a routing protocol which can properly handle dynamic faults can tolerate unidentified 
faults in the system, the performance of the routing protocol can be further improved by identifying 
the static faults and reconfiguring the network to avoid them. Fault identification also makes it 
possible to determine the extent of the faults in the system. This allows us to determine how close 
the system is to becoming inoperable. To the extent possible, the routing scheme should facilitate 
fault identification with low overhead. The faster that faults can be identified and the system 
reconfigured, the less impact the faults will have on network performance. 

4.2 Protocol Overview 

We have designed the METRO Routing Protocol (MRP) to addresses the issues raised in Sec- 
tion 4. 1 . MRP is a synchronous protocol for circuit-switched, pipelined routing of word-wide data 
through multipath, multistage networks constructed from crossbar routing components. MRP uses 
circuit switching to minimize the overhead associated with routing connections while facilitating 
tight time-bounded, end-to-end, source-responsible message delivery. MRP is composed of two 
parts: a router-to-router communication protocol, MRP-ROUTER, and a source-responsible node 
protocol, MRP-ENDPOINT. 
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In operation, an endpoint will feed a data stream of an arbitrary number of words into the 
network at the rate of one word per clock cycle. The first few data words are treated as a 
routing specification and are used for path selection. Subsequent words are pipelined through 
the connection, if any, opened in response to the leading words. When the data stream ends, the 
endpoint may signal a request for the open connection to be reversed or dropped. When each router 
receives a reversal request from the sender, the router returns status and checksum information 
about the open connection to the source node. Once all routers in the path are reversed, data may 
flow back from the destination to the source. The connection may be reversed as many times as the 
source and destination desire before being closed. End-to-end checksums and acknowledgments 
ensure that data arrives intact at the destination endpoint. When a connection is blocked due to 
contention or a data stream is corrupted, the source endpoint retries the connection. 

4.3 mrp in the Context of the ISO OSI Reference Model 

MRP fits into a layered protocol scheme, such as the ISO OSI Reference Model [DZ83] at the 
data-link layer (See Figure 4. 1). That is, MRP itself is independent of the underlying physical layer 
which takes care of raw bit transmissions. MRP is, thus independent of the electrical and mechanical 
aspects of the interconnection. The protocol is applicable both in situations where the transit time 
between routers is less than the clock period and in situations where multiple data bits are pipelined 
over long wires (See Section 3.2). MRP provides mechanisms for controlling the transmission of 
data packets and the direction of transmission over interconnection lines. It also provides sufficient 
information back to the source endpoint so the source can determine when a transmission succeeds 
and when retransmission is necessary. By leaving the retransmission of corrupted packets to the 
source, MRP allows the source endpoint to dictate the retransmission policy. As such, both the 
MRP-ROUTER and MRP-ENDPOINT are required to completely fulfill the role of the data-link layer. 
Since MRP provides dynamic self -routing, the protocol layer identified as the network layer by the 
ISO OSI model is also provided by MRP. 

MRP itself is connection oriented, though there is no need for higher-level protocols to be 
connection oriented. Together, MRP-ROUTER and MRP-ENDPOINT provide a reliable, byte-stream 
connection from end-to-end through the routing network. 

4.4 Terminology 

Recall from Chapter 3 that a crossbar has a set of inputs and a set of outputs and can connect 
any of the inputs to any of the outputs with the restriction that only one input can be connected to 
each output at any point in time. A dilated crossbar has groups of outputs which are considered 
equivalent. We refer to the number of outputs which are equivalent in a particular logical direction 
as the crossbar's dilation, d. We refer to the number of logically distinct outputs which the crossbar 
can switch among as its radix, r. 

A circuit-switched routing component establishes connections between its input and output 
ports and forwards the data between inputs and outputs in a deterministic amount of time. Notably, 
there is no storage of the transmitted data inside the routing component. In a network of circuit- 
switched routing components, a path from the source to the destination is locked down during 
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MRP fits into the ISO OSI Reference Model at the data-link layer. The routers in a multipath 
network use MRP-ROUTER to transfer data through the network. Each endpoint uses MRP- 
ENDPOINT to facilitate end-to-end data transfers. 



Figure 4.1: METRO Routing Protocol in the context of the ISO OSI Reference Model 

the connection; the resources along the established path are not available for other connections 
during the time the connection is established. In a pipelined, circuit-switched routing component, 
all the routing components in a network run synchronously from a central clock and data takes a 
deterministic number of clock cycles to pass through each routing component. 

A crossbar is said to be self-routing if it can establish connections through itself based on 
signalling on its input channels. That is, rather than some external entity setting the crosspoint 
configuration, the router configures itself in response to requests which arrive via the input channels. 
A router is said to handle dynamic message traffic when it can open and close connections as 
messages arrive independently from one another at the input ports. 

When connections are requested through a router, there is no guarantee that the connections 
can be made. As long as the dilation of the router is smaller than the number of input channels into 
a router (i.e. d < i), it is possible that more connections will want to connect in a given logical 
direction than there are logically equivalent outputs. When this happens, some of the connections 
must be denied. When a connection request is rejected for this reason, it is said to be blocked. The 
data from a blocked connection is discarded and the source is informed that the connection was not 
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Crosspoint Array 



The basic router has i forward ports and o = r • d backward ports. Any forward port can be 
connected through the crosspoint array to any backward port. The arrows indicat the initial 
direction of data flow. 

Figure 4.2: Basic Router Configuration 

established. 

Once a connection is established through a crossbar, it can be turned. That is, the direction of 
data transmission can be reversed so that data flows from the original destination to the original 
source. This capability is useful for providing rapid replies between two nodes and is important 
in effecting reliable communications. MRP provides half-duplex, bidirectional data transmission 
since it can send data in both directions, but only in one direction at a time. When data is flowing 
between two routers, we call the router sending data the upstream router and the router receiving 
data the downstream router. 

Since connections can be turned around and data may flow in either direction through the 
crossbar router, it is confusing to distinguish input and output ports since any port can serve as 
either an input or an output. Instead, we will consider a set of forward ports and a set of backward 
ports. A forward port initiates a route and is initially an input port while a backward port is initially 
an output port. The basic topology for a crossbar router assumed throughout this chapter is shown 
in Figure 4.2. 
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The words sent over the network links can be classified as data words and signalling words. 
The use of a single control bit which is separate from the transmitted data bits allows out of 
band signalling to control the connection state. This table shows how control signals and 
data are encoded. The control field is a designated log 2 (max(r, d)) bit portion of the data 
word. 



Table 4. 1 : Control Word Encodings 



4.5 Basic Router Protocol 



The behavior of MRP-ROUTER is based on the dialog between each backward port of each router 
and its companion forward port in the following stage of routers. In this section, we describe the 
core behavior of the router signalling protocol from the point of view of a single pair of routing 
components. 

4.5.1 Signalling 

Routing control signalling is performed over data transmission channels. Using simple state 
machines and one control bit, this signalling can occur out of band from the data. That is, the 
control signals are encoded outside of the space of data encodings. Out of band signalling allows 
the protocol to pass arbitrary data. Table 4.1 shows the encoding of various signals. The control field 
is a designated portion of the data word. Due to encoding requirements, it is at least log 2 (max(r, dj) 
bits long. The remainder of this section explains how these control signals are used to effect routing 
control. 

4.5.2 Connection States 

The states of a forward-backward port pair can be described by a simple finite state machine. 
Figure 4.3 shows a minimal version of this state machine for the purpose of discussion. Each 
transition is labeled as: <event>l<resultxdir>. Where <event> is a logical expression, usually 
including the reception of a particular kind of control word, < resulty is an output resulting from the 
reception, and <dir> is an arrow indicating the direction which the <result> is sent. For instance, 
the arrow from swallow to forward means that when a DATA word is received and the resulting 
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output direction specified is not blocked, forward the DATA out the allocated backward port in the 
forward direction and change to the forward state. 



/check s^nPlpCked 

Status 



tu rn /statu s < 



drop 




o c k e d 

Swallow 

data*-blocked/data-> 

/checksum < - 



d ata/d ata< 



drop/drop- > 



As a connection is opened, released, re versed, and used in the netw ork 

w ithin the netw ork go through a series ofconnection states as show n 

are initiated cbeyitjh teorf a controlw -&lr)d r(S± enTa Id ldie d bythe lo c al state 

ofthe router. Each transi t<@vwitb£/-4iasMeA<dilr^a. $Cevent> describes 

the controlw ord received alo n<gssuk>\i ss a ait e int p id t fee Dsr;d re s u 1 1 i n g 

fr o m the ecs/erli t§ an arrow indicatingthe dcrasa/C>tiis3 aiew th i c h the 

Fi gu re 4l$RP-ROUTER Connection States 



4.5.3 Router Behavior 

Idle port 

Wh en a connection betw een routers isn o idles tin 4 e , Mhei teo imnt feietio n i 
idle state, the backw ard porto IDLE woourtee t ct ria si s m r irtes stpi ®ndingforw ard 
in the nextstage ofthe netw ork.Aforw ard port interprets the recept 
thatitshould rem ain in an idle state and hence should notattem ptt 
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Route 

To route a connection th ROUTE gh & rrdo iist ferd ainto a router forw ard por 
forw ard portrecognizes the transition ofthe conifc rcoeli ryiirt grw) hi e trh e ze r o 
the connection w as idle, to a one. The router then uses the control ] 
routingdirection. The ro uROEfTE fe rowr d rtdi sr k qgh a backw ard port in the 
direction, ifone is available,and locks dow rDAtrh woadisieatiidbreso su 
passed in the sam e direction. If no outputis available,the connectio 

Data 

All w ords w ith control set bits w hie hRCHJffE w aocrd i mercfe ff© Irlwt w ridiegda 
through the allocated backw ard portofthe routingcom ponentto the 
in the netw ork. 

Data Completion 

Wh en all ofthe data w ords in a m essage have been passed into a r 
options. The sender can either drop t hi ai g; a> m jnrBeras $ ieo, id iw t iit ihno tiht ege 
direction ofthe connection around fora reply. 

To drop the connection, t rDROPrwpcurtdi ?t hjs'enE a eqROPovnocr flac a u s e s 
the routerto close dow n an open connection and free up the output 
portis free Kfeoa w so r d , i t f o r \B>Rfflrrai lso aigto the next router then returns to a 

To turn a connection around, t hmjRiNiwj o t (p ;owr this ngitvb b nouter receivef 
TURN w ord,itforw ards the turn outthe allocated outputport,ifthere 
statusw ords.Figure 4.3 show s a version ofthe protocolw hich returns t 
fillthe pipeline delayassociated w ith inform ingthe subsequentroute 
and gettingbackw ard data from the subsequentrouter.The filler w ord 
connection is sendingdata in the forw ar druRM irse rvecsdviiirfiction w hen 

In the forw ard direction, the ST»TOJSt wrorEdt uamndsCfite®a$HaM w o r d . If 
the connection w asnotblocked duringthe ro u tin geyele .after sen d ii 
be receivingdata in the reverse direction and w iki rfonr eve 1 irod nthviaslata t 
blocked, the rout ©RQfow w rndr dos 1 k o w^HBCl^tJlM and returns to the idle state. 

In the backw ard direction, the ro ittVtleViDLEiw p tyisse bi d fs $ e -seerra I in gth e 
reverse data. In order fora portto be in the backw ard direction, there 
the router so there w illalw ays be data to propagate follow inga back 

Checksum and Status Information 

Th eSTATUS a n dHECKSUM w ords form a series ofw ord w ide values w hich s 
the source node ofthe integrityofthe eaDhnffotui Oenr inn at & e phar teh u oglfo tain 
through the netw ork. At^tctS" Wi o n do ifrt rfoer m sthe source aboutw hich oftl 
equivalentbackw ard ports, ifany, through w hich the connection is ro 
arrives uncorrupted at the source en d p o in t, it allow s the source tc 
w hich the connection w as actuallyrouted. Wh en a connection is 1 
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inform ation serves to pinpointw cha enr 62 tth leo c k n d e Til bo me rn ainingbits < 
STATUS w ord,together w ith CfrBCKlSUiM sw i n rtdi © r w ords,can be used to tran 
longitudinalchecksum back to the source. Th is checksum isgenera 
the connection sRQUTEei nhcel d sd si tRCgjTllE w ord itself. Wh en data is corrupted 
to faults in the netw ork,the checksum provides the source endpoin 
identifythe m ostlikelycorruption source . 

4.5.4 Making Connections 

Wh en a connection is opened through a router, there m ayorm ayno 
desired logicaloutputdirection. If there is no available output, the r 
bits associated w ith the data stream . Wh en the conn ST&TOSo n is later tu 
w ord returned bythe routingnode inform s the source thatthe m ess 
Wh en exactlyone output in the desired direction is available,the rou 
through thatoutput. Wh en m ultiple paths are available,the router sw 
appropriate backw urnvMopih/ frtosm 1 ere desd available. 

This random path selection is the keyto m akingthe protocol rob 
w hile avoidingthe need forcentralized inform ation aboutthe netw oi 
protocolsim pie. Wh en faults develop in the netw ork,the source dett 
or dam aged connection bythe acknowled gm en t fro m the destinatic 
resend the data. Si nee the routingcom ponents selec teraaaird o m 1 y a m c 
stage,itis highlylikelythatthe retryconnection w illtake an alternat 
avoidingthe new lyexposed fault. So urce -responsible retrycoupled ' 
selection guarantees that the source can eventuallyfind a fault-free 
provided one exists.The random selection also frees the source fro n 
ofthe redundantpaths provided bydilated com ponents in the netw 
equivalentavailable outputs is an extrem elysim pie selection criteric 
can be im plem ented w ith little area addmosnesl Edc etrafa laelsp er e 4u i^^ s ! 
state inform ation notalreadycontained on the individualroutingco 

4.6 Network Routing 

Each routerin a path through the netw ork needs to see a different] 
w e require the routingspeci fie ation to rROUTEi wa ofiKd dt qo a k lid iw> re ffhc that 
im plem entation ofthe protocol, the deatt a hs er e m tiair till ii$ spt cbsei td 6 fife beyn t 
Betw een ro u tin gstage s,th e bits ofthe datapath can be perm uted so t 
see distinct control fie Ids. This bitreorderingallow s a single routin 
through several routers. Ho w ever,ifthe netw ork is su ffic ientlylarge,; 
routingw ord w illeventuallybe exhausted before the fullroute througl 
To dealw ith t hffiirsad kos w,s routingsw itches to be con figu red to ignore th 
in an incom ingm ess aS^LhOjiVscEd tri ngpar ation bit.This option allow s net\ 
be arbitrarilylarge. Eve rytim e alio fRChuarnwo o i dn agibei tesxhna BbkeTE d , t h e 
w ord can be discarded allow ingroutingto continue w ith the fresh 
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w ord. In this m an WXTA ,vt hoer irsrt the m essage foil oROUTB \g tda Ed oiisi gi n a 1 
prom oted tRDLtTEwt roerd a ft er the originalis exhausted. 

4.7 Basic Endpoint Protocol 

Netw ork endp MRPrElflBPOMMret o guarantee the deliveryofat least one un< 
copyofeach m essage stream to the ahdspioari tdcehsa hnnaet 1 gd m . t/hne es o u r c e 
ofa connection network MputQ wa hile a re cmd^iogB t c h a n n erietMsork ailed a 
output. At the m ostprima aicveh lie e^ifypourtkb iehaves like a router backw ard j 
each netw ork output behaves like a router forw ardijpuDtra. nTdh e contn 
output, how ever,is m ore involved than the sim pie data stream handl 
backw ard ports as described in the Section 4.5. 

4.7.1 Initiating a Connection 

Wh en a node wifcira tts;otan direction overthe netw ork, the netw ork inpu 
header and message checksum to the data and send it into the ne 
then follow ed w DRd>P e MIRIS toa indicate the disposition ofthe link folic 
transm ission.The i n i t i <a b rh se lsi k aege thus 1 

(ROUTE)* o (DATA)* o (DATA^ecfcsum )* ° TURN 

<OR> 
(ROUTE)* o (DATA)* o (V>ATA checksum )* o DROP 

Th eROUTE w ord orw ords specifies a path to the desired destination 
how the datapath and routers can be con figu red so thatunique rout 
routingcom ponentin the path betw een the source and the destinat 
constructed accordingly. 

In ge n e r a 1 , a no dlta rMae snm tmvp aaitk.iln the sam e w aythatrouters choose 
am ongthe available logicallyequivalent outputs, the node should cl 
available netw ork inputs. The bene fit s in term s ofdynam ic faultavo 
the routers as discussed in Section 4.5.4. Wh erlttjaeenGfeitffieDeriktipauthe ha 
specification w hich reach tihGKdsajiasewl© satldi dotaothe case in an extra-st, 
netw ork (Se c t i o n 3.3.4), h k> ai fccb u h ooeo $ e random lyam ongthe available p 
the netw ork. This random selection avoids w orst-case congestion o 
the netw ork and gives the routingalgorithm the propertythatitcan , 
netw ork. In these extra-stage cases, w e are sim plym ovingthe rando 
from inside the netw ork to the originatingendpoint. 

Each m essage should be guarded w ith a checksu mc © invi Inhge m e s s a g 
endpointcan identifyw hen a m essage has been corrupted. The lengt 
chosen so thatthe probabilityofa corrupted m essage havinga good c 
forthe intended application. The checksum should be constructed 
m istakenlydelivered to an incac iceecpt tre d daes w va 11 hdo rlrbeesisea.g3naet that 
w ayto ensure this is to include the destination node num berin the 
anotherw ould be to seed the checksum as ifthe firstportion ofthe d 
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butnotactua llytra n sm itthenodenum ber. It is not, in gen eral.p o ssib 
in the checksum and use them in place ofthe node -num ber forassu 
the correct destination. With extra-stage netw orks or tree netw orks, 
destination w ith m anydifferentroute-path specifications.In such cas 
notunique to e a c hi 6 d £ t am adt isoonm e ofthose routingw ords w illhave be 
the m essage in the netw ork (Se e SaDcntaoE kc t4.6)in a foarcE ht &i se fidrted c&estinatior 
If the stripped routingw ords w ere used in calculatingthe checksum 
no w ayofknow ingw hatthe stripped routingw ords w ere and hence 
com putation. 

If the sendingnode needs to guarantee t heact at h\a2 di bejstshaegd w sat BnaacttiiD ail 1 
node.it m ustturn the netw ork around aftersendingthe data rather 
Un less the node turns the netw ork and gets a reply from the destinat 
endpointw illnotknow w hathappened to the m essage inside the ne 
responsible form essage retransm ission in the case ofnetw ork corru 
the m essage forretransm ission u n til a so i tiavi) d efraocnk itbiwded tginm a Ini ins . 

4.7.2 Return Data from Network 

Aft e r send iTUgrff Ihret o the netw ork, the so ueroceei sen sdtp Oui aawdldhecksum 
inform ation from each router in t h n rp ead hi gd rp .e Boerd t Bd (y tshi mcplified rou 
protocolshow n in Figure 4.3 the replies w illlook like: 

(STATUS o CHECKSUM) 5 o DROP 
<0R> 

(STATUS o CHECKSUM ^ o (DATA)* o (DATA^ecfesum )* ° DROP 

<0R> 
(STATUS o CHECKSUM ^ o (DATA)* o (DXTAchecksum)* ° TURN 

HereVis the num berofstages in 4 hse the trwu anr lb ,earnodf stages into the netw 
connection w as routed be -ic<rM).i Inwt he b acs e kwe H e( r e a connection is bl 
the source w ill onlyreceive this status inform ation up to and inclu 
blockingoccurred. As noted in Section 4.5.3, the checksum and statu; 
source endpointw ith inform ation w hich allow itto localize the sou 

It is im portantto note thatthe checksum s com ingback from the 
determ ine w hether or not the des ciaisastiiuolftya - a d cpioya ri t tin ae sdsaut a . Si n c e 
dynam ic faultm ayarise atanypointin tim e,a faultm ay, forexam ple,o 
the m essage data w as sentpassed the routerbutbefore the routerpa 
case, a corrupted checksum could seem to appear from a router w 
corruption. For this reason, the checksum s from the routers serve o 
source endpointm ustuse inform ation from the destination endpoi 
ornotthe destination received the data uncorrupted. 

Wh en a connection is com pleted thr oTURghr tf h e hiasst Whoer &. easitienrattri e n 
node, the destination has the opportunityto reply. At the veryleast,thi 
data stream arrived uncorrupted. Dependingon the application, the 
send replydata alongw ith this acknow led gm e n t . Wh en the destinati 
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data should also be guarded w ith a checksum to protect a gainstdyn 
Wh en the destination sim p l<y a e kpntoo\#al mdegs kdnlgee^hl e di gm entencoding 
should be chosen so thatthere isiiiut plchi a In st liy e ^anUblvb'pdiEodbgaib e n t c a n 
be corrupted i n t ok ai p <w s liet iivgma e n t . Aft er its reply, the destination m aye 
dow n the netw ork connection orturn the connection around forfui 

4.7.3 Retransmission 

We m aysurm ise that a connection has failed to trans fer data suc( 
follow ingoccur: 

1. The path is blocked due to resource contention in the netw ork 

2. The destination node indicates thatdata w as corrupted upon an 

3. The return data stream does notadhere to protocolexpectations 

In anyofthese events,the source m ustretransm it the data ifitw ish 
ofuncorrupted data.The firsteventm ayoccurw hen the netw ork is c 
options indicate thatthere is a faultin the network. Blockingcan a 
kinds ofnetw ork failure.The faultin the netw ork could be transien 
permanent fa ult. Si nee the endpointdoes not know which kind o 
single fa ult occurrence is notconclusive evidence that a particular 
Consequently, the node endpointm ayw ish to save aw aythe replydat 
fa u 1 1 a n a 1 ys i s . 

Wh en retryingthe transm issio n ,th e source node has som e freedom 
The source m aychoose to retrythe sam e m essage or a m essage to a i 
m aychoose to retryim m ediatelyor after a w aitperiod. Wh i c h t e c h n 
the requirem ents ofthe application. If the application expects the m 
deli ve red to theirdestinationsin the ordertheyw ere gen erated ,th e n 
choose a differentm essage.Since the path on a retransm ission m ayb 
ju st taken, it m aybe beneficialto im m ediatelyretrythe failed conne 
w as due to blocking. Wh ile m uch w ork has been done on backoffand 
s ys t e m.gs (HLw n]), retransm ission policies forthis class ofnetw orks re 
research . 

If the netw ork continues to retain com plete connectivitybetw een 
the source w illeventuallybe able to deliver its m essage to its destii 
blockingoccurs atsom e stage in the netw ork, som e m essage has bet 
in the netw ork. Th us, in orderfora conn espat icocnntu) ebcet it© Ino mkuesdt a ta art a £ 
progressed 4-0 IsFagdow ingthis reasoning, as longas com plete connect 
netw ork, som e connection mdupsot ibiet ,raenaob Jl hregietfoer e,forw ard routing 
is alw ays beingm ade.Ifallconnections are treated equallyw ithin tht 
chance ofbeingrouted through the netw ork.Thus,w e can expectth; 
w illeventuallycom plete as longas a path existsto the desired endpo 

Ho w ever,ifthe netw ork has lostfullconnectivitydue to new lyarisin 
m ayno longer be re aid ihoanballel y, Afltdh ere is a large am ountofcontentio 
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destinations, the num berofattem ptsnecessaryto delivera m essage 
p r a gm atic m a tte r,w e often lim itthe num berofretries allow ed.Ifa cc 
i n a fixe d num b e r o ftrials ,th e cMRK-BNiaPOlNiror n pfaoiil 6 s atih ds inform ation bac 
to the node. At this point, the node m ayw ish to com m unicate w it! 
the source and nature ofits problem . Thteiantediee ttw aoyrakl sdoi awgri s hs ttioc i nt c 
verifyifnodes have actuallybeen new lydisconnected from the netw o 
from contention, then the node can take this opportunityto inform 
m anagem entprotocols ofexcessive contention. 

4.7.4 Receiving Data from Network 

To m inim ize end-to-end netw ork latency, a system m aybegin proc 
stream in parallel w ith the reception ofthe rem ainder ofthe data 
receivingdata from the netw ork, how ever,has no guarantee ofthe int 
receivinguntilitsees a checksum ogdi a rrd a T/rb ee gEnc prvingssingthe data a 
as itarrives onlyas longas itcan guarantee thatthe processingitdoes 
w illhave no adverse affects ifthe data is corrupted. 

For exam pie, consider a netw ork operation w hich is intended to 
w ritten into the ndoeds d i'si mt earn or y. If the onlychecksum w as atthe end o 
the destination node ciot iunlgldmd a" foretgjnmwemi o r y a setdieeideadt ab iescbaaiiaiegr 
the address could be corrupted. In such a case, a corrupt address 
to w rite data over som e arbitraryplace in the node's m em ory. Sim il 
guarantee aboutthe length oefct la evdnagt a\ht avtwlbbk tfa u It c o u Id cause the 
to send w hat appears as m ore data than the o riginal.un corrupted m 
w ould cause data in m em ory fo How ingthe intended destination bio 
these problem s,the m essage data could start w ith the destination 
are guarded w ith their ow n checksum precedingthe actual data t r £ 
the address and length are c o rre c tlyre ceive d , th e data m aybe storec 
corruption occurs in the data itself, the source w illretransm itthe d 
w hich the m em oryis beingm ain tain e d ,it m aybeaneeesstQ rtyhfe r t h e n 
m em orybeingoverw ritten untilthe finalverificatdocii <vef dlh e integrityof 

Anode ireayrve a bad data stream foreitherofthe follow ingreasons 

1. Checksum (s)indicate m essage m aybe corrupted 

2. Data stream does notadhere to protocolexpectations 

Wh en this happens, it ho ed iee ics eoi ni hygp xpected to indicate its rejection oft" 
If the recerivtidg can give som e indication ofw hythe data stream w as re 
m aybe able to use thatinform ation w hen failure diagnosis is neces 
piece ofinform ation the source node re qu ires is the factthe connect 

4.7.5 Idempotence 

In the introduction to this s MRPfrEMDfiOiwr qgusaariadn tttesfc the uncorruptec 
d eliveryofat least on e copyofthe m essage to the destination. We m i gl 
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the deliveryofexactlyone copyofthe m essage. Ho w ever, source -respo 
the potential fordyn am ic faultoccurrences allow s form ultiple deli 

Considerw hathappens w hen the source receives a corrupted che 
orotherindications thatsom ethingis w rong. Itis safe to assum e sor 
notbe clear w here the problem occurred. Pa rticularl y, ifthe faultonl 
the destination, the source can belie ve that an operation failed w h 
success fu 11 y. Wh en the source thinks the operation has been corru 
nature ofthe protocol has the source retrythe operation since ther 
operation has notbeen accepted bythe destination. Ho w ever,w hen 
in w hich onlythe return data oracknow led gm entw as corrupted, th 
data stream a second tim e. 

The consequence is th at all op er MRff <rnnus sptidEmSffDtemL Thda it $ a npge r - 
form ingan operation mrudttipiedi unc 82 sdci fferentresultsfrom perform in: 
once. Ourprevious exam pie (Se c t i o n 4.7.4) ofa cross netw ork w rite op 
since w ritingthe sam e data tw ice w illnotchange the data w hich i 
ever,a netw ork operation w hich caused a rem ote counter to be inc 
idem potentsince increm entingthe counter m ore than once w ould 

There are a few choices fordealingw ith the idem potence requiren 
all operations MRPhdiicrte aitslfto be idem potentorw e can im plem enta 
betw een appli MBPtw ohni k h ngul arantees idem potentm essage delivery. 

The Transm ission Control Protocol (TCP), in use on m anylocal-are 
"reliable "data stream s byusingsequence num bers [Po s 81] t o guaran 
d e 1 i ve r y. Wh en a source needs to com m unicate w ith a destination, t 
destination fora valid setofsequence num abcehr ai . nThqei s op la icckee a" m i <d) a 4 £t e 
transm itted to the destination w ith a differentrso (|uee ki a; m. pnsutmia b k r . Th i 
o fa 11 th e sequence num bers it has seen so thatexactlyone copyofe 
destination is passed alongto higher-levelprotocols.In this m anner 
source -responsible retransm issiea cahr anfietsesraegfe gk ififet pnti sekliyri $ e m p o t e 
the protocollevelabove TCP. 

Wh ile one could im plem enta TCP-s tyle unique se qiM&Rjscuecrhu m b e r p r c 
a solution is ine ffic ientforhigh -speed com m unicationsin alarge-scal 
overhead in term s ofthe space and processingtim e required to track 
m essages based on sequence num bers could easilybecom e m anyti 
space required forbasic m essage tran slhra hsS^igi ai aat ft yd M feenr <n ea t hve if,ase 
designingthe low e st level com m unicationsprim itivesto be idem po 
avoidingthis cost. 

4.8 Composite Behavior and Examples 

Havingdetailed th MRb ians ihsapirevious sections, this section review s t 
behaviorand show s severalrepresentative exam pies ofprotocolope 




Fi gu re 4.4: i616 Mu 1 1 i b u 1 1 e r fly Ne t w o r k 

4.8.1 Composite Protocol Review 

The source endpointfeeds m essages into a routerin the firststage o 
w ord is trea Rfifctraw bhrd . At e a c h 3lOangii,st laetherpipelined through the net 
or blocked byexistingconnections.Betw een router stages, the bits a: 
routingbits to each router. Wh en the bits are exhausted the subseque 
to s w all oR©UTEhwe ord and prom DATA w hoer fir $ d> be fctoteTEnw; wrd . Wh e n 
the entire m essage is fed into the netw ork,t tlURNsw) o r d et ov rid kegc si ee r a 1 1 y 
the connection. The firstrouter iSffiATtlrS a m CfaE©KS&iM Iwroeitdsrasid forw ards 
the data received from the second routerin the netw ork. The firstre 
routerin the netw ork w ill b STATftJSB a snedMECMStifMrw «d tredrs's The source w ill, 
thus successi vSTtAyrtJS-ccHeciKffiUM w ord pairs from eacbnaflttteticim.tfffehce 
connection is blocked atsom e point, t rDEOlb fbol t dc w d rffgitttrste r w ill send 
CHECKSUM pair; tlhROF w ill close dow n the connection as itpropagates b ai 
the connection w as notblocked,the source willreceive data from 
fin a SlTATUS -CHECKSUM pair.The connection m aybe reversed orclosed bythe i 
has com pleted its reply. 

4.8.2 Examples 

Considerthe netw ork show n in Figure 4.4n f)huet pj Oos s>iib ripe uptal6ha sr rfr o m 
highlighted.Forthe sake ofsim plicityin exam ples,letus consider th( 
the paths betw een input6and output 16. The sam e protocolis obeye 
exam pies easily generalize to the com plete netw ork. 

Each ofthe follow ingexam pies (Figures 4.5 t h r o u gh 4.10), show s severa 
cations overone ofthe paths indie aotnddn tffiogm rbe e4t4v EaaMhrcouters is lal 
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w ith the control/data w ord transm mtrtaadadtBdiwgittlliietkgfcdaraEKEdiffln ofd 
flo w . Oft en w ords ofthe sam e type are subscripted so the progression 
tracked from cycle to cycle. 

Opening a Connection 

Fi gu r e s 4.5 and 4.6 show how a connection is opened through the ne 
Figure 4.5succeeds inoaipiffincimogn through the netw ork,w hereas the on* 
blocked atthe router in the third stage. 
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Sh o w n a b o ve is a c yc 1 e -b y-c ycle progression ofcontroland data throug 
a connection is success fullyopened from one endpointto another. 1 
through the netw ork advancingone routingstage on each clock cycle.Tl 
m essage is the routingw ord. 

Fi gu re 4.5:cSa essfulRoute through Netw ork 



Dropping a Connection 

Fi gu re 4.7 show s an open connection beingdropped in the forw a 
a connection from the reverse direction proceeds identicallyw ith 
destination reversed. If t h e c o n fl)KOT iiopnr osphalga cekde d pt hoe the router atw 
the connection is blocked. 
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In the eventthata connecctdesis cfra Hjioopfcicesdithrough som e routingcom p 
nentin the netw ork,the m essage is discarded w ord-b y-w ord at th at n 
show n above depicts such blockinga routerin the third stage ofthe ne 

Fi gu re 4.6o fflnection Blocked in Netw ork 
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Wh en the transm ittriri gnoei Inwt d> ff k ied e s to term inate a m essage,itends the i 
w i t hDlSDP control w o BRbP ffihld ow s the m essage through the netw ork reset 
each link 6m tiheec ition to idle aftertraversingthe link. 



Fi gu re 4.7:oDp pinga Netw ork Connection 
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Turning a Connection (Forward) 

Fi gu re 4.8 show sa success fulconnection beingturned and backw a 
the netw ork.Figure 4.9 show show a blocked connection is collapsei 
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Wh en the source w ishes to know the state ofits connection and get 
destination, irutfer erdtsoathe netw ork follow ingthe end ofits forw ard tran 
data. As TRJsm w orks its w aythrough the netw ork, the links ittraverses ar 
In the pipeline delayrequired forthe link to begin receivingdata in the 
each rout pnogra; e rnit sends status and checksum inform ation to inform th 
connection stat lURAft aBt|h]eopagated all the w aythrough the netw ork an 
routers alongthe connection have sent status and checksum w ords, 
alongthe connection. 

Fi gu re 4.8: Re ve r s i n g a n Op e q nNe ew td © ki C 



Turning a Connection (Reverse) 

Turns from the reverse direction proceed basicall ySSASlifo rw ard turn 
a n dHECKSUM w ords, rout ©ATsA-KXtm wlords when turned from the reverse d 
Figure 4.10show s a turn from the reverse direction. 
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In the event that the connection w as blocked atsom e router in the ni 
router w illbe unable to provide a reverse connection through the net 
sendingits ow n checksum and status w o rEfepPlb ea b lot© ktfaatrouterw illse 
source. AaROfe pro p a ga te s back to the source, itresets the inter veningnet 

Fi gu re 4.9: Reversinga Bl o c k © di fUas tewt i<© mk C 

4.9 Architectural Enhancements 

Beyond the basic routingstrategy, there are severalprotocolenhanc 
perform ance undeil icoenrat a i n c o n d 

4.9.1 Avoiding Known Faults 

As described so fa r.ali ro u ter decisions are m ade p urelyb yrando it 
faultshave occurred in the netw ork and are know n to exist,itw ould 
view point, to determ inisticallyavoid them .In extra-stage style netw oi 
the faultypath(s)from our list ofpotentialpaths. Wh en random lysele 
onlym ade from am ongthe setofpaths belie ved to be non-faulty. In di 
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Aconnection flow in Igidkwdrd diirehcdion can be reversed again so that data 
m ayflow ,once again,from the originalsource to the originaldestinatio 
sim ilarto the originalre \BATA-H3Lpd:?a;teapits tiheatti rned duringthe reversal 
pipeline delayratherthan checksum and status in form ation. 

Fi gu re 4.10: Re ve © snen fiction Turn 

them selves are m akingthe detailed path decisions. For dilated rout 
router to avoid faultylinks or routers. 

Port deselection is one w ayto achieve this determ inistic fault-avoidanct 
components. Thatis,ifwe have a way teoa db sbedektwoairdup - !! rd ,ffiv e c a i 
determ inisticallyavoid ever traversinga know n faultylink or attem p 
router. The sem antics ofthis deselection are such that the deslectec 
itis alw ays busyand hence rem oved from the setofpotentialbackw 
direction. Wh en connections are routed through the router, the dese 
neverused. 

Forthe sam e reasons, itis usefulto be able to deselect forw ard po 
to a faultyrouterorfaultyinterconnection link m aysee spurious dat 
inter feringw ith the norm aloperation ofthe restofthe routingcom pc 
portcauses the forw ard portto ignore anyconnection requests itrec 

Afaultylink betw een routers is thus excised from the netw ork byd 
backw ard port pair attached to the link. Afaultyroutingcom ponen 
from the netw ork bydeselectingallthe backw ard and forw ard ports 

Chapter5addresses the issue ofidentifyingfaultsources.Chapter5a 
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Shown above is a connection blockingscenario where the success! 
show n in bold and the blocked connections are show n in thick gray. Cc 
success fullym ade betw een nodes 16 and 15 and betw een nodes 10 and 
connection betw een nodes 7and 15 w as blocked in the third stage si 
no free outputports in the intended direction. Si m ilarly, a connection 
nodes land 15 fa i 1 e d due to blockingin the fourth stage.Each ofthese bio 
consum es routingresources up to the stage in w hich blockingoccurs 
non -blocked connections consum e resources. New connections w hi 
these blocked connections continue to utilize netw ork links can, in t ' 
failed connections. 

Figu r e 4.11: Bio c k e d Pa tlrtisbim tateMiflyNe tw o r k 

f o r r e c o n figu ration and considers netw ork recon figu ration in m ore d 

4.9.2 Back Drop 

As described so far,w hen a connection bea? bnmt res b&dwkoerdt #t1h seo m e 
pathfrom the so usrrceemt a isnt agEpen.The router links w hich the connec 
up to s tea - gem ain allocated to the block llifeNcoorDROPead ttihoen tut hit <blf ttBn e 
m essage w orks itw ayup to the blocked router. If the netw ork is ver 
in later netw ork stages w illhold resources in m anyroutingstages at 
Fi gu r e 4.11). Fu r t h e r , t h e 1 e n gt b id if eiart is) hh ies ch e 1 d open w illdepend on 
ofthe inbtianl ecc tion data. Longer data transm issions w ill e xacerb ate 
connection on the restofthe netw ork. Si nee the blocked connectio 
stage,theym ayin turn block connections atearlier stages. 

To m inim ize the detrim entaleffects ofa blocked connection on su 
can be given the abilityto s hount d e <wt ino ai nfroo pne rt hce head ofthe data stre 
re qu ires som e w ayto propagate the inform ation thatthe connection 
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With fa st path collapsing, a blocked connection is collapsed in a pipe 
the pointin the netw ork w here blockingoccurs.The exam pie show n 
a connection encounters blockingand is collapsed usinga back dro 
routers in the connection closest to the blocked router are freed earl 
to the source end ofthe connection. 

Fi gu r e 4. 12: Exa m pie ofFastPath Re clam ation 

open connection to the source. On e sim pie w ayto achieve this is to a 
betw een each pairofbackw ard and forw ard ports inside the netw 
in p u tarn odh netw ork outputand theirassociated baokw eacr ten' ca m d f o r w ; 
becom es blocked, the router w hich notes the block baclg uses this ba 
drop line, toinform the upstream router thatthe connection is blocked 
now here.The upstream routerm aythen deallocate the backw ard po 
itforreuse,and pass alongthe blockinginform ation to its ow n upstrt 
m anner,the connection m aybe collapsed in a pipelined fashion sta 
and propagatingback to the source. If the source endpointis si gn ailed 
line before ithas finished sendingthe m essage,ittoo,can abortthe i 
source m aythen begin to retrythe connection. Figurrn eldl doenp i s t s how 
collapsed usingthe fastpath reclam ation. 

Fa st path co Hap sin gh as a num berofpositive effects on perform a 
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resources w hich are notbeingused to transm itsuccessfuldata.Sinc 
at the head ofthe m essage w here blockingoccurs,the routingresou 
are freed m ore quicklythan those in earlier netw ork stages. This is f< 
w hich block in later netw ork stages tie up m ore resources in the ne 
detrimentaleffecton the network than those which block in earli 
Figure 4.12w e see blockingoccurringin the third stage. The router in t 
from transportingdata in a few cycles. As a result, the total tim e thi 
is occupied w ith the blocked connection is sm all. The router in th 
backw ard portcarryingthe blocked connection foronlya couple of< 

Fastpath collapsingalso bene fit s faulttolerance.Withoutsom e for 
m atio n ,o n lyth e sendingendpointhas the opportune teyit\ri nsgiutdow n 
endpointm ustw aitforthe connection to be turned or dropped bei 
outputornetw ork inputcarryingthe data stream .If a faultoccurs dur 
a router to continuallysendnabaiteahtiiseire cwfc ta\*if tg § sohwa o ithc Etio n . 
The routingresources consum ed from faultyrouter up to and inclu 
network inputoroutputrem ain unusable as iti o g aosf tfii set ija a tlh p e r s i 
collapsingaddresses this problem .If a connection continues to pro\ 
the netw ork endpointexpects the connection to turn or com plete, 
conform to the expected pro hodcpool i nht emr a f u b\q it^e b a ctM adt c oap line to i 
collapse ofthe path from the dow nstream end ofthe connection. Oi 
faultyrouter,w hich is co n tin u allysen d in gd ata,w illnotaffectthe net 
routerm aycontinue to send data to itsim m ediate neighbor. Ho w e ve 1 
the associated backw ard portand w illnotforw ard the data an yw h e 
a netw ork endpointcan shutdow n a faultyconnection stuck in the < 

The disadvantage offastpath collapsingis thatthe source no longer 
inform ationbackfrom everyrouterinablockedpath.The source will 
inform ation back from everyrouterforconnectionsw hich are n o t b 1 ■ 
w hen connections are blocked, the fastcollapse does n ot alio w the 
this detailed connection status. Si nee this inform ation is o fin tere ( 
basic functionality, fast]tia)tihldcbldaspispipgs'ted as a con figu ration optio 
be disabled and enabled bythe testings ystem .Fastpath collapsingw 
operation and disabled as necessary for fa ultdiagnosis. 

4.10 Performance 

The perform ance data presented i'.n. thi gup - ees vB621iasncdh 2L!|9Xewr e( r e 
gathered on dilated nlVBUt w ©tihk faatspragh collapsingand faultm asking.) 
MRP allow s the perform ance to degrade gracefullyw ith the netw ork a 
netw ork. 
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Arouter(orinterconnect)m aydevelop a faultsuch thatitappears to alw 
The exam pie show n above depicts how backw ard path reclam ation i 
routers in the path to be reclaim ed at nhdeproeiqute stofthe receivinge 

Fi gu r e 4.13: Ba c k w ard Re c lea nnn a;td tiiro h> 6Cu c k Op e n 

4.11 Pragmatic Variants 

There are a num berofvariants on the basic protocol that arise fro 
e r a t i o n s . On e prim aryconsideration is the difference betw een the la 
an operation and the frequencywith which we can begin new opei 
look at several points in the protocol where pipelingthe transm iss 
connection bandw idth since the latencyinvolved m aybe gr eater tha 
can be accepted. 

4.11.1 Pipelining Data Through Routers 

Ifw e can clock data betw een routingcom ponents faster than w t 
routingcom ponent,w e m aybe able to achieve higher bandw idth b 
m ultiple clock cycles to travp Dsneetrhte Tk> iiB tpra g Cioc nmlarlym akes sense w 
data can be clocked ata m ultipleofthe sw itc hpi n gleart A uPcpy eri rro iuigh the 
data through the routingsw itches JvrRffi-KOuTrER Iper bnt opcaoclt 6TIn ethoaie place 
w here itdoes show up is w hen connections are reversed.Instead o 
delaybefore return data is available (Se e Fi gttiioen4a3jl,ttfe eor e y* lid k tfceraenveaidyd 
add itio n a 1 p ip e lin e stage through the routingsw itch.These addition 
factthatitis necessaryto flush the router's pipeline in the forw ard dir' 
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To increase the netw ork bandw id th , so m etim es it m akes sense to j 
com ponentsulcirplfas;! one k cycles transpire betw een the tim e data ent 
tim e data exits the routingcom ponent.Show n above is a connection e 
case w here a single additional stage ofpipelip bnnga ail.dded to each rou 

Fi gu re 4.14: Exa m cpii o Cc t i o n Op en w ith Pipelined Routers 

direction before reverse data can be forw arded along. The addition 
w asted since theycan be used to send additionalconnection and re 
the source en dip iooim A 1 l\y4 filler data p&ik-ttciLE avs ct hi <i , c a n be used to hold th' 
connection open duringthe pipeline delays.Figures 4. 14 and 4. 15 show 
scenarios. These additionalpipeline delaycycles occur w hen the c 
direction. 
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Wh en pipeliningthe data trans fer through routiritgioaamlponents, there 
delaycycles w hen the direction ofdata transm ission is reversed. The; 
to the need to flush the router pipeline in the forw ard direction and it 
direction. Show n above is a connection turn w hen a single additional; 
added to each rpocuit fenngtc o m 

Fi gu re 4.15: Exa m pie Turn w ith Pipelined Routers 
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4.11.2 Pipelined Connection Setup 

The longestlatencyoperation inside a routingcom ponentis often 
w hen arbitration m ustoccurfora backw ard port. If w e require that 
the sam e am ountoftim e, or the sam e num berofpipeline cycles,as 
through the com ponent,the latencyassociated w ith connection setu 
the latencyassociated w ith data trans ferthrough the router. Alterna 
generallym ore pipeline cycles,for connection setup than for data tr 
this, each routerw illconsum e a num bero f w ords equalto the differe 
latencyand the transm ission pipeline latency from the head ofeach 

Co n s i d e r , a routerw hich can route data alongan established path 
three cycles to establish a new connection. The routerw ould consul 
head ofeach routingstream before btnwn ae sc aibolre a <m A sfd>arlwl asrhd a"hce rem a 
ofthe data out the allocated backw ard port.Figur© rkh6es <h td <wns t h e d a 
establishm entin this scenario. 

Wh en pipelined connection setup is used, there is no need foran 
on each router since each router alw ays consum es one orm ore w o 
stream itroutes.The onlyotheraccom m odation n e oiipd) font th is k in 
constructthe header w ith the appropriate paddingbetw een routin 
the sees the proper routingspeci fie ation. 

4.11.3 Pipelining Bits on Wires 

In Se c t i o n 3.2, w e noted thatin mmmd'o f tehee nnsa dlw roartka n,the transitti 
the w ires betw een routingsw itches often exceeds the rate atw hich n 
section, w e su gge sted thatw e could pipeline m ultiple bits on the w 
from beinglim ited bythe length oflongw ires. In m anyw ays,this tec 
pipeliningdata through a routingcom p o4nlfe.ln Watshdpi $p n Isisiei d g,nv Se c t i o i 
can supporta data rate w hich is higherthan the transm ission latenc 
oracross the w ire.The effects o f w ire pipeliningon the routin gprotoc 
effects ofpipeliningon the ro u tin gp ro to co l.Like the routerpipeline, 
each w ire pipeline m us titoiea Hid si heecdt im it ht<h em filled in the reverse direc 
backw ard data is returned. Aga i nrvfiW-HDLEdva <b ar ds ,urc h satsbt&i enserted into tl 
return stream to hold the connection open w hile the w ire pipeline 
show s a turn scenario w hen bits are beingpipelined on the w ires be 

On e important assumption is that the wire between two compo 
num ber ofpipeline registers. Ho w one ensures that this assumpti( 
an im portantim plem entation detail. With a properlyseries term in; 
between routers, we do nothave to w ottfi ynagbt<bm tore fle tht eow a rae n cThsee 
w ire w ill look, for the m o st part, lei kesasi kny tff i-at k 1 iasy.t (Fhm ake the tim e-c 
appro xi m ate an integralnum berofclock cycles so that it does beha 
registers.Abrute-force m ethod is to care fullycontrol the length and 
the w ires betw een com ponents.Am ore sophisticated m ethod is to 
drivers them selves to adjustthe chip-to-chip delaysu ffic ientlyto m ei 
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In situations w here data trans fer can occur w ith significantlyless latei 
setup, i t m aym ake sense to pipeline the openingofthe connection. T 
show s the case w here a connection cannotbe established to forw ard 
portto the backw ard portuntiltw o cycles afterthe initialroute reques 
the routingw ord and the w ord im m ediatelyfollow ingitare notforw a 
r o u t e r . On ce the route is setup, data flow s through the routers in the 
fashion w ith no additionaldelay. 

Fi gu re 4.16: Exa m p le o f Pbpneri bneet iioGl Se t u p 
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Depicted above is a turn occurringin a 3-stage netw ork w here there is 
longw ire betw een stages tw o and three. Si nee ittakes tw o clock cycles 
w ire in each direction, duringa turn the second stage router m ustfill 
w i trhVTA-lDLE w ords w hile w aitingfor return data from the third stage roui 
end ofthe longw ire. 

Fi gu r e 4. 17: Exa m pie Turn w ith Wi re Pipelining 

Chapter 61ookklnntg tiegrt h n i qupeps ototsthis assum ption in further detail. 

4.12 Width Cascading 

In Se c t i o n 2.7.1 wenotedthatthenum berofi/opinsonanlCislim ited 
We also noted that the num berofavailable i/o pins is grow ingslow 1 ; 
Se c t i o n 3.6 w e pointed out that this lim itation re qu ires a trade -off bet 
channels in the netw ork, the radpiocrorfnl ,cahn do tth td radgj 4 stitmco ru t>i ft g 
com ponent.To firstorder,the num berofi/o pins re qpdi r lead: ifo n a s qu a r e 
d, and data ch ann edss(2iw' itd- tflfc Since the totalnum beroflCpins at a g 
technologyand costis afcjfecg # e cmn is feat u; to,, (a a a& a s : 



kpins — 2 ■ r ■ w ■ (t 



(4-1; 



Alternatively, w e can considerhow to break the function ofa single 
into m ultiple ICs .Ifw e can splitthe function ofa router across sever; 
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beingable to build large r p rpm rita net sooirttiry^cd) rmg able to use sm allera 
cheaper ICs in buildingour routingcom ponents. Of course, this segri 
bene fit i f w e can do itw ithoutincurringthe costofa large num berof 
constituent ICs . 

Width cascading is a technique forbuildingroutingcom ponents w ith w idt 
com ponents w ith narrow data channels. Th is allow s us to build a 1( 
a large w idth,w ithoutdirectlysacra ncciaa igd irdg txca fitjii Haiila biro, w e 
can casmadffitingcom ponents to u^ a c C( hi e i w vb iac to/ god \tehr ned bythe follow in 
e qu a t i o n s : 

^pins — Z • v • W r outer ' & \^--^) 

^cascade — ^ ' ^router y^'*) 

Of course, equation 4.1w asjustan appro xi m ation ofthe pin requirem e 
onlyto the extent that the overhead,in term s ofpins interconnectin 
com pared to the totalnum berofpins on the router. 

4.12.1 Width Cascading Problem 

Width cascadingexploits the basic idea thatw e can replicate the n 
spondingdata channels to achieve a w iderdata path. Th is replicati< 
onlyhave the desired effectofbehavinglike a w iderdata netw ork iftl 
tw o netw orks is identical. Thatis, the data launched at the sam e tim 
arrive atthe destination sim ultaneouslyor block atthe sam e stage in 
com plicated bythe follow ingaspects: 

1. Faults m ayoccur affectingone copyofthe netw ork differentlyfrom 

2. The techniquesin use forselection am o n ga vailab le ,equ ivale n t o u t 

On a m ore basic level, then, the problem becom es that ofensuring 
com prisinga sin gle,logicalrouter handle data identically. 

Ourbasic problem is iocsckdedsrqputarssi aatmfiituallyconsistentstates.Thati 
setofprim itive routingelem ents w hich ac tpaos nae snitn sgheo ho lgl ccaol rr n a <t itn 
theircrossbars such thatthe sam e forw ard ports are trans ferringdat 
Each crossbarrouterm ust, there fo re, allocate connection requests i 
follow ingconditions hold, the routers be in m u tu allyco n sisten t stat 

1. each router seeathieeKlirmiE cequests 

2. each router se Dwcneesctth (Encrequestsin the sam e w ay 

3. each router turns oorndnr© <ptsi(Bncalttehe sam e tim e 

The routers m aysee diffe rent connection requests ifthe routingspt 
to som e fault. On ce the routers in the cascade see differentconnecti 
channel, the routers m ayhandle the route differently. Ifanyofthe ro 
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assignm entofa backw ard port, the subsequentrouterstage w ill n o t 
th a t is ,th e logicaldata channelw illbe splitand parts ofthe data stre 
routers causingthese routers, in turn ,n ot to see identicalconnectio 

No n -dilated routers in the absence offaults w illroute identicalco 
directions, as longas the routers w ere alreadyin a consistent state 
connection requests. Ho w ever,a faultinside the router ICm aycause a 
and m isroute data or open or close connections in a m annerincor 
stream . Al s o , w ith faults the cascaded routers m aysee differentconnt 
ifthe routers utilize their fij aiendeocilni otro <d an il tiei ffe rent channels in the sa 
direction, the subsequentstage w ill s e e splittra ffic . 

With faults,it is possible that one portion ofthe data stream app 
connection when another does not. This can lead to som e ofthe re 
and reclaim ingbackw ard ports w hile others do not.The routers w i 
since theydo no t all h ave the sam e setofbackw ard ports available to 
r e qu e s t s 

4.12.2 Techniques 

We can addressthe di ffic ulties associated w ith w idth cascadingbyu 

1 . Identicalcontrolfields 

2. Shared random ness 

3. Wi r e And in -use indication on backw ard ports 

4. En d -to -end checksum s across logicaldata channels 

The firstthingw e need to ensure is thateach routerin a cascade w i 
control fie Ids in the absence offaults.Thisis easyto giRQUffB, ntee bysim t 
TURN, DROP, a n d)ATA-lDLE values provided to the w i d e , 1 o gi c a 1 connection ir 
from copies ofthe correspondingciot hver doI iwt ionrglesl fonr ehnet sp. rWen m u s t a 1 
constructthe w ide data suchptdiratfBtisth erso tih tei rs g am an va lues in its contr 
In the absence offaults and dilation, this w ill eonnsru er e ta a d rr ff quuets;trs see 
and the sam e turn and drop indications. 

We w antasetofcascaded routers torn akethesam e random decisis 
connection re qu est. To satis fy this re quirem ent,w ffipihwdBtring orm o 
com ponentfor "random "data.This externalizes the random ness so tl 
routers in the cascade. We can then m ake sure thateach router m ak 
ofconnections to available backw ard ports based on the connectio 
va 1 u e s . We call this technique ofusingshar shdrqckmhdcmmeassL ffa n d o m i za t i o 
avoid the need for extra com ponents to provide random bit stream 
w e can allow eachpoaiietnin goc germ erate one pseudo -random bit strea 
bitstream s can then be w ired to random inputs as appropriate w h 
n e t w o r k . 
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Fi gu r e 4.18: Ca s c a d e d Ro u t e r Co n figu ration using FourRouting El err 

The critical piece ofstate inform ation which needs to be m ainta 
am ongrouters is w hich backw ard ports are beingused byw hich fo 
directlycom paringthis state am ongrouters on everyclock cycle w o 
(i ■ o) b i t s am ongrouters each clock cy® hen. eNottiionngst ha x e rm ca sntyew o r d s 1 o 
and hence connection requests tend to occur in fr e qu entl y, w e can ap 
sim plycom paringw hich outputportsare actuallyin use on each eye 
im perfectin determ in in ga 11 p o s sib le fau ltyco n ne ction s,b u t it w i 1 1 c 
im m ediately. We can arrange forthe router to recoverfrom aim ostall 

To m ake the com p ariso n ,w e add a single extra pin per backw ard 
the router to indicate w hen the backw ard port is beingused. We t 
correspondingbackw ard-portin-use pin sAMDfcc ca is figudr a di © ai u Wdires ri n a w 
a router be gins to use a connection, it si gnals that the port is in use 
data and m onitors its in -use pin. The in -use pin w illonlybe asserte 
cascade agree that the port should be alloctatl erdg tlfrm ftetrhaenimpapsEO p r 
signalappears unasserted w hen a router has tried to allocate a back 
itis in an inconsistentstate from its peers, deallocates the backw ard 
In this m anner,the cascade w illonlydiffer in term s ofbackw ard p 
transientam ountoftim e w hen a fau It o ecu rs. In m ostcases,ifthe fa 
be m isrouted,diffe rent backw ard port s^fflrve isld Ideectteecdt Ihrod fahuilst w ii rie d 
clear its effects.Figure 4.18show s hp own fonut b eos u it igit^JQ s> nndi b di da -r e d 
random nessschem e m ightbe connected to form a cascaded router 

The onlytim e a fault w ill fool this appro x(bmmt icot moirsswahfE rb anirugltip 
opened sim ultaneouslythrough the router. In this case itis possible 
will be assigned duringthatsam e cycle. If the fa ult causes a setofm 
connected to the sam e setofbackw ard ports, but in a differentcon 
w i r eildD-w illbe m istakenlyindicate thatthe connections are valid.Asi 
faults and connection requests occur in frequ entl y, w e do have som e 
occurs in som e stage otherthan the finalstage ofthe netw o rk ,it is gen 
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connections w illbe goingto differentdestinations.As a consequence 
diverge in a subsequentnetw ork stage. Wh e ANDhwe yld e hdui Vedrqgew, trh eh w ire 
connections. If w e are usingthe fastpath co llap sing ^ ps cdr i h e d in Se c 
connection can be collapsed quicklyfrom the pointofdivergence.Co 
quicklyrecover from this inconsistentstate. 

Ho w ever,ifthe faultoccurs late in the netw ork or the spliced conn 
to the sam e destination, it is possible that a netw ork output w ill s e 
the endpointw e can m ake use ofthe forw ard checksum to determ i 
at the endpointis invalid. For this reason, the forw ard checksum in 
be com puted across the entire w idth ofthe logicaldata channelrat 
stream s . 

4.12.3 Costs and Implementation Issues 

The fir storder cost forsupportingthis w idth cascadingis: 

1. oadditionali/o pins on the router (one fo reach backw ard port) 

2. Severalrandom inputand, perhaps, one random outputpins 

Ext ernally, w e w illhave to place a pul kMDjp bnns e aarc <h w let hve i M hrae vt-to rout 
the shared random bits and the in -use ctctlni nrg)t i mi re eosn Bnh ge ridi e-ir sf(E rwt h ieesd( 
AND to be s u ffic ientlysm all, the prim itive routingelem ents in a cascade 
each other. Si nee the state s h o miai rg bseitswdeoenne cwjAiMrhs It In ue cW li r e dt+i e 
num berofcom ponents supported bya case aldtey ria tlH mrittb dnb y nhye p h 
architectural features. On e reasonable option forlarge,high-perform 
cascade togetheras bare d bed oi rl ea (Se ai Sei£)ct Tibi p nrht ii-c h kpdm 1 e w ill 
allow the shared interconnectto be veryshort. Of course, ifone is w i 
to the cascade and dediasaadth irnopu ut e p f<hi rb i«l an 'n a ri gu s e signals,itis possib 
to avoid the electrical issues associ aAtNTDd w ith the externalw ire d - 

4.12.4 Flexibility Benefits 

Width case adingallo w sus to im prove the bandw idth w ithoutincr 
com ponent.Width cascadingalso im p2^^Q n yft 8 -g, bryaari bemwi tup mil s Isesra gey, 
to be trans ferred to and from the netw ork in few erclock cycles. Ad d i 
m ake narrow channelprpmri eirfet ar a liltoi w gienogm sto take advantage ofthe 
resources forincreasingthe radixor dilation. In creasingthe radixw i] 
T s , w hile increasingthe dilation w ill d e ere a se contention and incree 

4.13 Protocol Features 

In Section 4.1w e identified five keyfeatures w e hoped to design into i 
In this section, w e brie AyMffiTROeRw u It bnwg PItId 4o co 1 addresses these issues 
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4.13.1 Overhead 

MRP keeps overhead low bybeingm inim al.The onlyoverhead data i 
data through the netw ork are: 

1. Routingw ords w hich determ ine the destination 

2. Forw ard checksum (s)used to insure data integrity 

The forw ard checksum s are untouched bythe routers in the netw c 
endpoints to use checksum s ofthe appropriate length to guarantee 
corruptm essage acceptance forthe applicatiorti tThtas boj ifftei hegnwt ords; 
bits to fullyspecifythe desired destination. 

Routingw ords directlyspecifya desired outputdirection so thatm i 
to handle e a c hoin ot mtio gcre qu e s t. Eac h routingdecision is sim pie, all 
occurw ith low overhead.End-to-end acknow led gm ents allow route 
connections. 

4.13.2 Flexibility 

Bybeingm inim aland MRPccca m hiei fete sdi J yadapted to a w ide range of 
levelprotocolrequirem ents e ffic iently. The basic protocolis lightw ei 
detailed packet sem antics. Th is allow s applicatioppeirtaainst^m net 
to im pose the packet structure w hich bestsuits them .Transm ission 
routerprotocoland a single connection m aybe reversed anynum be 
e ffic ientdata trans fers. 

4.13.3 Distributed Routing 

MRP has the desired distributed routingproperty. No globalknow le 
through the netw ork. The incorporated random ization even obviate 
know ledge o f fa u 1 1 s . At the sam e tim e, local recon figu ration can be u 
faults w ithoutrequirirtigt;y noy sri ra glre tahn a notion ofthe global state oftl 
Dilated routers allow the netw ork to m ake m ore e ffic ient detailed ro 
inform ation. 

4.13.4 Fault Tolerance 

End-to-end checksum s allow corrupted connections to be detect 
coupled w ith end-to-end acknow le deganc h ndt a tm a kras sa me i fcasiu cttheast s fu 
transm itted to the desired destination at leadto arm icze t i (Fine i a oprm tbhi n a 
selection alongw ith m ultipath netw orks allow s the protocol to eve 
w hich existsbetw een a source and destinat bonnnpe antri.oTh e Ik b hi i byd ds hs h i 
ends ofa connection allow s faultyconnections to be collapsed byei 
features com bine to guarantee that: 

1. Anydata corruption is detected 



2. An yfailed or corrupted connection w id bbc e p ft iidi dbdy ah eist i irpaota otmve 1 ; 

3. Aslongasthenetw ork continues to com pletelyconnectallcom m 
can determ ine a fault-free path w hich allow s anyconnection to 1 

These features are guaranteed regardless ofw hetherthe faultis dyna 
even w hen faults occur duringan ongoingtransm ission.Further,sim 
can be integrated into the protocolto m ake itpossible to m inim ize 
identi fie d, static faults on netw ork perform ance. 

4.13.5 Fault Identification and Localization 

Acorn bination offorw ard and backw ard checksum s alongw ith c 
pro viNiRB w ith the abilityto locate and detect fa ults.Forw ard checksun 
through the network to m ake sure that no fault which affects data ti 
Backw ard checksum s provide an estim ate ofw here faults m ayhav 
checksum s provide an at -speed indication ofthe data integritybetw 
inside the netw ork. Status indications help localizfci iagit ht §f r o u tin gc 
actual path taken byanyroute in the netw ork. Status inform ation a 
pipeline reversal slots w hich have no data to transm itotherw ise. t 
data adds no overhead to the routingprotocol. Ad ditionally, since ei 
actually verify the integrityofthe received data, this backw ard status 
norm aloperation. 

4.14 Summary 

In this chapter, w e have described a source -responsible, pipelin 
pro to co 1 su itab le foruse w ith m ultipa tihorme t aa t>i okns, cWeu spal w dtfa/aithae n 
to -end m essage checksum s and source -responsible retryled to a pro 
occurring fa ults. We also saw thatsuch a protocol cou Id be realized w 
decentralized control. With som e sim pie optim ization to the basic 
per form ance can be achieved even when fa ults existin the network 
protocolw as easilyadapted to accom m odate severalpra gm a t i c op 
relative perform ance w e can extract fro m ourim plem entation techn 

4.15 Areas to Explore 

Thischapterhasdem onstrated theexistence and details ofaprotoc 
perform ance and dynam ic fa ult -tolerance w ith low overhead.There 
further exploration relative to optim izingsuch a protocol. Ad d i t i o n a 
s u gge sted in this chapterw hich could be furtherquantified. 

1. In Se c t i o n 4.7.3, w e noted thatthere are severaloptionsforw hen ani 
data. Th ere is m uch room forunderstandingthe bbtspa ttrha t e gy fo r r e 
netw orks. 
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2. The routingschem e descri boblWidus, thitiiallnp tienrfo sr m ation aboutdo' 
stream network tra ffic isused w hen selectingam ongavailablepat 
selected could be im proved ifrouters had som e inform ation abc 
succeedingin oaiintan giao ai through a p articular b ackw ard port.Fin 
sim pie and cheap w ayto get tim elycongestion inform ation bad 
rem ains an interestingproblem . Further,utilizingthat inform ati 
destroythe fault-tolerantproperties ofthe protocol or ad verselya 
routingdecisions can be m ade also rem ains a di ffic ultproblem . 

3. In Chapter3we presented a ggr egate data for the perform ance oft 
m ultipath netw orks.Itw ould be w orthw hile to quantifythe relat 
each ofthe keyfeatures.Particularly, itw ould be desirable to quan 

(a)Thebenefitgained byusingdilated routers in a netw orkrathert 
sized non -dilated routers in an extra-stage netw ork w ith com p 

(b)The bene fit offe red byfastpath collapsing forvarious netw orks 

(c) The bene fit offe red bym asking fa ults as opposed to lea vingthem 

(d)The im pactofvarious pipeliningstrategies on netw ork perfor 

(e ) Ho w w ell this heuristic routingstrategy stacks up againsta c £ 
does have globalknow ledge ofallconnection attem pts in the 

4. The strategy presented here is based entirelyon circuitsw itching, 
to -end bene fit s and tim elyretries forlow -latencydata transm issio 
nature ofthe protocol. Ho w ever,w hen the netw ork becom es lar 
ofpipeline stages required to getfrom source to destination, com 
typicaldata -stream ,itw illbecom e lesse ffic ient.In such scenario 
longerto hold the connection forthe replythan to actuallytransm 
thatin such a situation packet sw itch imgew; t> m lodrts ta fffieei et h fcl ynbt y r ( 
allow ingthe routerresources to be used byother connections w 
pipeline to flush and refillin the reverse direction. It w ould thus be 
sw itchingschem e w hich offers com parable faulttolerance and 
bene fit s as this protocolw hile allow ingm ore e ffic ientuse ofrout 
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5. Testing and Reconfiguration 1 



In this chapter we consider how to deal with failures in the netwo 
w e noted thatthe netw ork and ro u tin gpro toco 1 alio w us to continu 
anyknow ledge about w hyconnections fail. Ho w ever,w e also noted 
bene fit associated w ith m askingthe effects offaults.Further,itis use: 
the netw ork. In this chapter, w e develop reliable m echanism s for i e 
and discuss their utility. 

5.1 Dealing with Faults 

The prim aryquestion to address is: "Wh atdo w e do about fa ults in tl 
the previous chapter, w e could do nothing. As longas the netw ork cor 
all com m unicatingprocessors via som e p ath .correct operatio n w il 
do nothing, there are a num berofim portantthingsw hich w e do not 

1 . Ho w m any fa ults have occurred in the netw ork. 

2. If the com plete connectivityrequirem enthas been violated 

3. Ho w close the accum ulated fa u Its nomcet it\ri tyiroelqutiinegra reyict s 

4. Wh ich com ponents are faultyso thattheycan be repaired 

We established in Section 4k? .n tohwa h nfa aislksiidg)es have a perform ance be 
introduced in Section 2.1.1 som e com pmct idnegsmt © cb eel s sm laay a dll (frw m the 
netw ork. In such cases it is im portantto determ ine w hen isolatio 
guarantee thatonce isolation has occurred the isolated processor (s 
uncontrolled w ay. 

Eve n though w e can operate oblivious ofthe detailedt Mature offaul 
have reason to be concerned aboutw here fa ults have occurred and 1 
on system perform ance. We w antthe abilityto: 

1. Identifyfaultycom ponentand interconnect 

2. Re c o n figu re the netw ork to m ask know n fa ults 

Identifyingfaultycom ponents and interconnect is useful in several 
necessary for the system to recon figu re itselfto avoidcthgsfanyltgrcom po 
determ ine w hich units need physicalreplacem entw hen the system 
Ha vingan estim ate ofthe fa ults in the netw ork is also necessaryto idei 



'Portions ofthis discussion w ere firstpresented as [De H92] 
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how likelyisolationwilloccurwhennew fa ults arise. We have aire ad 
allow s im proved perform ance in the presence of fa ults. Re con figu rat 
forfacilitatingon-line repair. 

In order for this identification and recon figu ration to w ork reliab 
large -scale m uniprocessor, itm ustfunction: 

1. Withouthum an inter vention 

2. Withouttakingthe entire m achine outofservice 

In norm al operation, w e expectthe system com pre saiai tg rfauurl m u 1 1 i p r ( 
som e tim e afteritoccurs and recon figu re around it. We expectthis to 
continuouslyw hile the system rlit inpsr.cFc e fcs Dges-swc in He hi m a y h a ve fr e qu e n 
o c c u r r e aigc eMTTF of6m inutes as in Section 2.5), it is i n e ffic ientand oftei 
to bringthe entire m achine out ofnorm al operation in order to pe 
recon figu ration. Further,w ith netw orks ofthis size and potential fa ul 
m anage the recon figu ration reliably, m uch less econom ically. 

Finally, w e expectourfaultidentification and recon figu ration to be i 
are not robust a gainst fa ults, theym ayw ell be useless w hen actually 
liabilityto system integrity. 

5.2 Scan-Based Testing and Reconfiguration 

As introduced in Section 2.2, t h ealEEEes st a p do otrui nt d a tr-y scan architecture 
[Co m 90] is em ergingas an industrystandard forcom ponentand boar 
overhead, the standard allow sfunctionaltestingofcom ponentsand s 
Ad ditionally, the standard allow s com ponent specific registers w hi 
com ponentcon figu ratio n . 

Ho w e ve r , t h e IEEE standard TAP architecture has draw backs w hich m 
in lar ge -scale, fa u lt-to leran t system s.The singularand serial nature 
critical, sin gle pointoffailure in the testsystem .Architects are force 
serial scan chains or to use m anyshortscan chains. The form erall( 
affecta large num berofcom ponents w hile the latter re qu ires signific 
m anyscan paths. Furtherm ore, the standard TAP architecture integra 
sm allportions ofthe system into diagnostic m ode w hile leavingthe 
norm al operation. As noted in the previous section, it is ine ffic i e n t t 
from norm aloperation forfaultdiagnosis. 

5.3 Robust and Fine- Grained Scan Techniques 

In this section, w e presentthree sim pie additions to standard sea 
techniques to be utilized effectivelyin a fa ult -tolerant! suect t ich g.rHi : e basic 

1. Mu 1 1 i -TAP scan archei aeccht ipHserm-e n t i s gihtei p lm &m site- s s ports allow ing 
the com p o naeenctet £>sbe d from anyofseveralscan paths. 
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Fi gu r e 5.1: Me sh ofGridded Scan Paths 

2. Po r t -b y-p ort select portio Hieaacchomponentcan be independentlydis 
Se c tio n 4.9.1). 

3. Pa r t i a 1 -e xt e r n a 1 spmt n a-ie i)(Ehs canned in boundary-testm ode indepe 
operation ofotherports on the sam e com ponent. 

With these additions, a scan -based diagnostic and recon figu ration sch 
robust, m inim allyintrusivefaultlocalization and repair. 

5.3.1 Multi-TAP 

Su p p o r t i itg pnl ai 1 a; ste- ss ports on apsim eghet asm sim pie extension oft 
redundant resource and inter cl oiip A a; atet sitot- e a sp. oWitpshp am eun m's scan 
c a p a b i 1 i t i eascccaens foesd throug^tizpri e/psferm ialscan paths. Tip bsnael hot w s t h e c < 
to be tested and recon ngu red even w hen there are faults alongsom e o 
m u 1 1 i p 1 e TAPs o n a ps ormglai c,cs ma n paths can be arranged so that a m ini 
com ponents are severed from thatsp hen st(E as it spya ttii nfa b ^tm . vFo r instance 
can arrange the scan paths in a system w ith dual -TAP com ponents su 
are on the sam e pairofscan paths. Th is guarantees thattw o faultysc 
one com ps> cnceenstsiirb le.Figure 5.1shGD\pslao gryM dhadhthas this property. 

Wh en addingredu ancdcaens ts stoa f>ao cnoemt, there are several issues w hich 
addressed to assure us thatw e can realize the potentialbenefits ofh; 
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address the issue ofresource contention betw een the scan paths, 
cannot both perform a boundaryscan through the sam e com ponen 
alw ays have the ability tpoocnoennttr'e taacioniaths from a non-faultypath.Thi 
m ustbe able to m inim ize orelim inate anypotentialfor inter ference 
can achieve these goalsusingtw o sim pie techniques: 

1. Re source conflictresolution in favorofthe m ostrecentrequ ester 

2. Sparse encodingofscan instructions 

Conflict Resolution 

Presum ably, access to the scan paths is beingcoordinated at som 
everythingis w orkingproperly, there should never be a resource co 
Ho w ever,w e are concerned w ith assuringthatreasonable behavior - 
the system are notbehavingproperly. We gi ve each TAP its ow n instruc 
register.Theseregistersbehave exactlyasin a standard TAP [Co m 90] . Di ffe r 
arise w hen m ultiple TAPs a 4 fce tm ep $ 4 an e scan registers.This w ould oc< 
the di ffe rent TAPs attem pted to load instructions thatreferenced the 5 
sim pie conflictresolution schem e is to give the TAP loadingan instruc 
the path. Wh en the new instruction is loaded, the instruction in anyc 
bypass instruction. Si nee each TAP has its ow n bypass re gister, there w 
to the bypass register.Assum ingw e can su ffic ientlym inim ize the chai 
can success fullyload a non-bypass instruction into its instruction re 
fa ult -tolerance criterion. The schem e allow s a non-faultyscan path 
resources aw ayfrom a faultyscan path.Figure 5.2 show s a possible art 
w ith tw o test -access ports. 

Sparse Scan Instruction Encoding 

Th e IEEE TAP protocol forloadinginstructions is su ffic ientlyinvolved ; 
scan path from success fullyloadingan instruction in m ostcases. Ho w 
guarantee thatfaultybehaviorw ill not inter fe r ep w a tdin rt .o Si rrfa pliyacce 
faults,such as stuck -at fa mckf) sxd m ttoAlS^; l(icn ceks^villpreventapathfrom be 
able to load an instruction. Astuck -at fa ultin the da 1M, lrbfj))e s ordata-p; 
w illforce the dow nstream com ponent TAPs to see allzerosorones,m 
the data lines to cause instructions w ith allzeros orones to be loade 
notthe onlykind offaultoursystem m ustcontend w ith. Sp arse instr 
w ayto m ake the likelihood that a faultypath can load a valid instruc 

The basic idea in sparse encodingisto m ake the num beroflegalen 
to the num berofpossible encodings.The n o n -legal in stru ctio n enco 
instructions so that theycannot inter fe re w ith the norm al operati 
correctingand detectingcodes in com m on use fordata storage and 
are com m on exam pies ofsparse encodings.In this application, w e 
errors and preventingthem from corruptingnon-faultyoperation,n( 
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Each scan path has itsow n instruction register,bypassregister,a 
Each ofthese is identical to the their counterparts in a single-! 
The prim arydifference in the scan architecture as a resultofthe 
the instruction decode and con flict resolution unitw hich repl 
decode unitin a single -TAP com ponent. 

Fi gu re 5.2: Scan Architecture fop Din ae In TAP Co m 

exam p 1 e , w e used a sim pie instruction enco riiimig cdiheocrh seuwi hd <nh c o 
a nra-b it data w ord,the space ofpossi b^Tfe" 1 ! nvshreurce taisfihw epdt&a so2f 1 e ga 1 
instruction cf o Idl w se i s s2s u m e that the clock and m ode bits behave in e 
m annerto load in an instruction, but that the data lines hold rando 
code w ord gettingloaded are: 



p 

± ra 



ndoni-load 



num b e r o fie ga 1 cffd e s_ m 
num berofpossi b21 H ef Tl codes 



Of course, when choosinga checksum ,one should m ake sure thatt 
w ords are notlegal, checksum m ed instruction encodings. 

Mc Hu gh a n d Wh etselpropose addingpari (]ylT80]itnos t d ic <n ttiirfync e m -c o d i n g 
rupted instruction w ords.Sparseencodingisam oregeneralencodin 
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protection againstdata corruption. 

Costs 

Review ingthedual -TAPe xam pie show n in Fiigui oen fa2, w oessts a 6h q ttiha tf eadd i 
w ith a m ulti -pTAPrc © rmt are: 

• Four additionali/o pins per additional TAP 

• On e additionalinstruction and bypass re gisterper additional TAP 

• On e addition Mta>puetrp aidtd itio n al TAP 

• On e co n flict resolu tio n unit 

• Ad d itio n a rMtiJXieps ufd reach shared registerpath 

Form ostm odern com ponents,the lim ite d2r7elf) »aut hceer iiM iia^HD pins (Se e Se 
area. As such, the additionali/o pins w illgenerallybe the first order c 
TAP controller. No te that the size ofthe con Hi ct resolution unit is prop 
the num b er o fp o te n tiallysh are d resources and the num berof TAPs . 

Compatibility 

As noted above,in the fault-free case ,ifb o th scan paths through a co 
access the sja mnee en ct me gi s t bta ,-TAPec npnanentw illbehave id e n tic a llyto a st 
sin gl e -TAP com p o ht e -rTAP (Mp nanents pladteioamadcbdirden on the softw are 
assure thatthe scan paths through a given com ponentneverattem ptt 
In the faultycase,as longas there is a non-faultypath through a com p 
can be used as a standard TAP as longas the faultypath does notm ai 
instruction. Astandard single -TAP com ponentm aybeiita eTAPi n a s ys t e m 
com ponents,butthe single -TAP com ponentis susceptible to anyfaul 
controllines. 

5.3.2 Port-by-Port Selection 

Se c t i o n 4.9. b idnut red the idea ofport deselection for fault-m asking, 
disablinga portin this m annerim plythatthe com ponentw illignori 
in w hich the port is disabled. This m eans the com ponentw ill not 
the disabled port, and the com ponentw illalw ays choose to avoid tl 
service.From the scan path,portselection/deselectipnrisesnitm plyacc 
con frgu ration register. 
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5.3.3 Partial-External Scan 

On ce w e have a w ayto selectivelyrem ove som e ports on a com p o n ( 
itm akes sense to be able to perfornpaosmtaasntiangont ebayepho cr o tana s i s . Th 
capability gives us finer granularitycontrolover the scan paths allow 
on subsets ofthesystem while the restofthesystem remains in oper 

To supportpartial-external scan, the corht ipoonnad n it atero flt * ibonh aari xt le d; 
at selectingthe appropriate subset ofthe noilho BMkJJSecs ui n dt h ey-s c a n p a 
b o u n d a r y-s can padksAS alrlybt e bypassthe portions ofthe norm albound 
notbeingscanned duringaparticularpartial-scan operation. 

5.4 Fault Identification 

Wh i 1 e d e s cMR&ihn Action 4.7.3, w e noon aiob <Ch iaotrcs could failw hen the pr< 
appeared to be violated,w hen the destination rejected the incom in g 
tion w as blocked. An yofthese cases could be indicative offaults in th 
even in the absence offaults and is the onlycase w hich doesnotnect 
sorthas occurred. No netheless,som e faults m aym anifestthem selve 
the router checksum s,ifchecked,provide another indication ofwh 
occurred. 

The faultidentificatio MRp imo tvhd sd/biyw illgive us an indication ofw h 
have occurred and m ayprovide enough clues to narrow the search sp 
Ho w e vmrr's fault identification generallydoes not tell us exactlywhat 
blockingm aybe due to faults or due to high congestion. Corrupted 
in frequent, transient faults or due to static faults. It is often useful t 
itm akes sense to recon figu re to m ask out static faults but itm aync 
in frequent, transient fa ults. Si nee the checksum data travels back al 
alw ays possible thatanycorruption is inserted bya routerbetw een t 
source ofthe data.Aroutercloserto the netw ork inputcould corrup 
ofa router further aw ay. Sim ilarly, anyrouter could be responsible ft 
from the destination. For these reasons, w e need a separate m echan 
ofeach suspectelem entin the netw ork. 

In the m ostnaive case,w e could m ove the entire system into testr 
boundaryand intdrirt nelss toa tie fsatct h e i n t eogmityeocfta ®enr jiind everycom pone 
In this m anner, all structural fa ults in the interconnection can be 
com ponentfaults m atchin g2ol.u)rc m no Id e Id (Set elheimed. Real faultyw ires 
com ponentscan be differentiated from transient fa ults and congeste 
false faulttheories. 

Ho w ever,ifthe system is large,the im pactofrem ovingthe entire syste 
fortestingcan be significant.The largerthe system ,the higherthe rate ( 
and the larger the am ountofhardw are thatm ustbe rem oved from 
s u ffic ientlylarge system s ,it is o ften neither econom ical nor practical 
from service. 

With the additionalsupport fl.jb, wcer icbaend m na $eec t hoe nte stin gsign ific an 1 1 ; 
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intrusive.The addition ofport-b y-p ortselection and partial -externalsc 
ofscan testing. At a given tim e,w e can isolate a m inim alsubsetofth 
faultyand perform functionaland scan testing. By disablingthe ports o 
to a physicalsetofw ires and perform ingscan tests on justthose por 
can quicklydeterm ine the integrityofthe interconnectin question.Sin 
on com ponents connected to a given com ponent,w e can isolate the 
from the netw ork to perform functionaltestingon thatsingle com po 
the system m aycontinue norm aloperation w hile testingoccurs. 

This schem e provides a capability for fa ult-identification and local 
intrusive.The inform ation gained from this scan testingprovides det 
nature and extent ofsuspected faults. With this inform ation, the sy: 
position to diagnose the extent offaults,perform recon figu ration to i 
risks associated w ith continued operation. 

5.5 Reconfiguration 

5.5.1 Fault Masking 

Wh en faultycom ponents or interconnections are identified,the fau 
figu ringthe system to avoid the faultycom ponent. Aga in, the scan -based 
inter fa ce to this recon figu ration. Th pane hi ttyst ai£lagabFaf>of> tmi ntroduce 
Se c t i o n s 4.9.1 and 5.3.2, p ro vid e s o n e effective m eansoffaultavoidance.If 
leavingeveryporton everycom ponentconnected to the faultycom po 
rem ove the u n it fro m the functionalportion ofthe system so thatitc 
operation. Si m ilarly, i f fa ults occurin the w ire so, d nie/cit iocnrcrB a e inve r § o J 
disablingthe porton allaffected com ponents w illeffectivelyexcise th 
s ys t e m . 

This m echanism ofdisablingindividualports w orks effectivelyforr 
the sam e reasons itw as necessary for fin e-grained diagnosis.Ourm ul 
to function correctlyas longas there is atleastone enabled, non -fa ul 
nodes in the netw ork w hich need to com m unicate.The sem antics o 
com ponentw illignore the portthroughoutthe tim e in w hich the po 

5.5.2 Propagating Reconfiguration 

We should note that recon figu ration, both w hen excisingfaults and 
for testing, is best per form ed accounting for the netw ork structure, 
w e end up rem ovingall ofthe backw ard ports in the sam e logicald 
router be cdecsitenck faranyconnectionsw hich are routed through itdesl 
direction w hich has no enabled backw ard ports. We m ayw ish to e 
connections from the netw ork, as w e 11. Th a t is ,w e use the sam e bas 
in Fi gu r e 3.28 to d etermdi efawu 11 tijcrh) uters should be excised alongw ith th 
to m aintain good connectivityin the netw ork. If w e do notrecon figu r 
im pact fa lis entirelyon perform ance.As longas pathucs did Isexist b etw 
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the routingprotocolw illcontinue to find the paths through the netw 
r e c o n figu ration, itis possible for a routerto selecta backw ard port \ 
cannotroute the data stream anyfurthertow ards its destination. Pro 
this m anner,prevents a router from everroutinga data stream into si 
also w orthw hile to note thatthe im pactofnotrecon figu ringin this n 
path reclam ation (Se cu tp (p kd r4t&2J i sFisgu re 5. 3 show s an exam pie w here it 
sense to propagate the recon figu ration and m ask additionalroutingc 
We can also note thatitis notalw a ys possible to propagate back and 
in the conservative m annersu gge s t e d ai bio Heaw ri bhdoeustfr sacrrl atth ne gnaedtdv o r 
Wh en the recon figu ration algorithm showdiu icne Bi gunr e3ei28i w na s3.5.fi, tilt 
w as noted thatthis recon figu ration w as intended forthe harvestcas 
con figu re som e nodes outofthe netw ork. In the interest ofprovi din 
rem ainingnodes,the algorithm m ayend up isolatingsom e nodes fo: 
netw ork. If w e do notw ish to isolate nodes from the netw ork,w e m i 
does not leave som e nodes disconnected from the intectwaolrk before 
routers su gge sted bythe recon figu ration propagation. Figure 5.4 shows 
still exi st betwnd epno arilsein the netw ork and the a 13gS8rwt bmlcbhow n in 
s u gge stm askingcom ponents in a m annerw hich isolates nodes from 

5.5.3 Internal Router Sparing 

If a routingcom ponent provides sparingw ithin itself, the scan m 
recon figu re the unitto sw ap spares w ithin the router. For som e i/o 1 
crossbar routers, there m aybe plentyofadditionalroom forfunctio 
size is dictated bythe pin-lim ited i/o. In these cases, it m aym ake se 
structures on the com ponent. Fa ults inside som e partofthe routingc 
b y r e c o n figu ringthe com ponentto use an alternate portion ofthe rou 

5.6 On-Line Repair 

The com bination ofaccurate ufapul fctdl ovciatiMz^ttaooapbecr form recon figu - 
ration, allow s us to realize system s w here the fa ult -repair loop can 
m echanical inter vention, at least up to the faultlevelprovided bythi 
gram s m onitoringthe system integrityare em pow ered to testtheorie 
the system to bestm ask the effects offailures.Further,w ith a know le 
m ents necessary for com plete system operation alongw ith an accu 
the m ach ine .program scan assess o ve rail system integrity. 

Wh en outside inter vention is necessaryto repair the system , the s 
disablingand port -based scan allow forin-operatiorp loenpel ntsem ent.If 
into a physicallyreplaceable subsystem are disabled, itis possible to 
w ithoutanyfurtherinterruption ofsystem operation. Ofc ourse,the ele 
ofthe system m ustalso be suita bdg. fEamld \eenr erjfoln eStanp ecnotnfi putersyster 
[An d 85], Stratus fa ult -tolerant com puter system s [We b 90], Th i n k i n g Ma c 1 
Once replaced, scan testingcanodieiteeerrrioinea tihde fm m tee trico nal integrityo 
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Show n above is a con figu ration ofa m ultipath netw ork w here itm aym a] 
outadditionalrouters (Th e netw ork w ithout 8i22;).o\¥i tf%u ration is show n 
the loss ofthe tw o routers in the second stage ofthe netw ork, router 3 
routers has no outputs in one logicaldirection.As a result, allconnec 
connectto destinations 8through 16 w hich getrouted through this route 




Show n above is the sam e netw ork w ith firststage router3also con figu red 
Bypropagatingthe recon figu ration, itis possible to avoid the dead-end c 
appeared in the previous con figu ration . 



Fi gu re 5.3: Prop a ga ting Re con figu ratio n Exa m p 1 e 
100 




Show n above is another recon figu ration ofthe netw ork fir st show n in 
netw ork has sim ilardead-end path problem to the one show n in Figu 
cannotrecon figu re this netw ork to avoid the dead-end paths w ithouti 
the netw ork. 




Show n above is the resultofpropagatingrecon figu ration to avoid having 
As s u gge ste d ,th is propagation results in the isolation ofnode 6from the 

Fi gu re 5.4: Prop a ga ting Re con figu ratio n Exa m p 1 e 
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rep laceplocrD erm t . Wh e na tcheemr ent is properlyin stalled and identified as 
disabled ports into the replaced subsystem can be re -enabled allow 
fullservice. 

5.7 High-Level Fault and Repair Management 

For several reasons, it is desirable to coordinate testingand reco 
Particularly, som e central coordination isrequired for: 

• Netw ork IntegrityAssessm ent 

• Re c o n figu ration Planning 

• Coordinatingscan paths 

• Collectingfaultdata from m ultiple sources 

Wi th integrityassessm ent,w e w antto determ ine how safe itis to c 
w e w antto know the likelihood thatw e w ill s u s ta in a faultin the ne 
m achine inoperationalorrequire serious recon figu ration. In a yield b 
w e sim plyw antto know the likelihood w e w ill retain com plete con 
a har ve st fa ult -tolerance environm ent,w e m ayw antto know the like 
in a period oftim e. Ifw e com bine the know ledge ofthe netw ork to 
know n faults,and a m odeloffault-occurrences,w e can m ake these 

Integrityassessm entscan be used forseveralpurposes.In the sim p 
allow s hum an operators to assess the d an ger level asso ciated w ith 
case ,it cou id be used to schedule dow n-tim e forphysicalrepair.Th 
feedback into the run-tim e system and tune operatioihaodordingly. F 
ofcom plete netw ork failure increases, a system m ayw antto check 
frequentlyto m inim ize the im pactofthe failure.In a harvestsituation 
certain d an ger level at w hich itbegins to evacuate the com putation a i 
node. If a node can be evacuated before itbecom es isolated, the cost 
node situation can be avoided.Ifa d anger level can be chosen such 
evacuated before isolation, w e m aybe able to getaw ayw ith sim pier 
isolation. Si m pier strategiesgenerallyrequ ire lesshardw are support 
operation. 

Actual recon figu ration based on the results ofintegrityassessm en 
levelcoordination.As noted, a centralnotion offaults is necessaryto 
recon figu ration as described in Section 5.5. 

As noted in Sectioht i5TAip mean paths w illrequire central control to a 
resource conflicts.High-levelcentralcontrolis necessaryto utilize th 
recon figu ration. Forlarge-system s,w e w ould like to avoid a single-poi 
havinga singlepntiifcybele fo r all sc an paths. The fact thca;tsw "ri ii ao/ea ne d u n c 
paths to anycom ponentin the netw ork, allow s us to distribute the c 
tolerate the failure ofscan controllers in the sam e w ayw e tolerate tl 
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w e allow the scan path controls to be distributed, their action w illr 
well. 

Wh en decidingwhatfaults are worth pursuing, it can be usefulto 
mult hpol des.For instance, w hen pursuingtheories about a fa u 1 1 y n e t w 
connect, itm aybe insightfulto integrate faultdata from severalnodes\ 
orinterconnectin question. 

From a faulttolerance perspective,w e w antto distribute the high- 
The easiestw ayto m anage this is to choose som e subsetofthe proct 
perhaps allofthem ,and assign thesldimcgaegii t-heevtf a £ & s d ifnc g>annt d or e c o n fi£ 
uration.The control ofscan paths could be distributed evenlyam on 
on the application, w e could either dedicate a setofnodes to perfoi 
sim plym ake thispartofthe w ork perform ed byeveryprocessorin th( 
to interconnectthese coordinators as w ellas connectingthem to the 

The coordinatingnodes w illw antto m aintain a replicated, distril 
fe stations, system integrity, and current system con figu ration. Re plica 
ofthe coordinatingnodes m ayfailorbe isolated from the netw ork.Tl 
depend on the fa ult -tolerance requirem ents fo reach particular syste 
identi fie d, this inform ation needs to be shared am ongthe coordinati 

5.8 Summary 

In this chapter, w e exam ined the integr ation oftestand recon figu r 
tolerantnetw orks. We began byreview ingthe m otivation forfaultlocal 
addition ofm uht pppl® rTAIpso rtdeselection,and partial -externalscan to : 
testingappro ache s, w e deve loped robust m echanism s to support f 
recon figu ration. We describe how these mechanism sallow scan -b as 
to proceed in a m inim allyintrusive m anner, allow ingthe portion o 
or recon figu red to continue norm al operation. We also summarize 
fa cilitates detailed fa ult identi fie ation, system recon figu ration, integi 
repair. Fi nail y, w e sketched how the controland data m anagem ent; 
can be integrated w ith the m ultiprocessingsystem , itself. 

5.9 Areas To Explore 

In Se c t i o n 5.3.1, w cb ui n 4rtiH iet yjfb ach pament to e xilst 1 pine mcia n paths. 
The m ultipath scan a b d>lp tpygirveus mist yhoe w ire the scan paths in a w ayw h 
the effects o fa n yfau Its . Th a t is ,fo r a given num ber ofscan paths, w e ca 
ofcom ponents to scan paths w hich i s o lea a e h tfla a Ft i\e sres It. cO<6 aro p o snee, n t 
the scan paths are usefulto us onlybecause theyallow s us test and 
underlyingnetw ork. The redundancystructure oftha enceot w not r k s h o u 
when selectingan assignm ent ofcom ponents to scan paths. For ins 
be wise to assign all the components in a single stage ofthe netwo 
Consequently, w e also w antto optim ize the assignm entofcom ponen 
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w hich m inim izes the effects w hich scan -path faults m ayhave on net' 
we assum e a com po naecnctews iiiibclhe ivs an sacta n paths is faulty, w e w antto r 
effects ofscan-path introduced faults on netw ork connectivity. Consi 
ofourscan paths, w e Mlnio g/p hhya itceaxb U«d calityw hen w iringscan paths 
Bykeepingscan path connections physicallylocal,w e keep the cost< 
down and keep the reliab hlmteycotf hh gh i nWiiiecn looking for good assign 
ofcomponentsto scan paths, we would also like to 4 bdpy.l o i t a 1 a r ge 
Determ iningreliable and practical assignm ents ofscan paths on to] 
redundancycharacteristics rem ains an interestingproblem to stud; 
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6. Signalling Technology 



In the previous chapters w e fo cussed on the architectural and organ 
robust, low -latencycom m unications. In this chapter and the next v 
physicalaspects ofhigh-perform ance netw ork design.In Chapter4w h 
protocolatthe data link and netw ork levels (Se MRFicga n d d4. fc),xws e p o i n t e 
on top ofanynum ber ofphysical transport layers. Here,w e address 
practical and eco hlormg dcia t sp gnjap ctctssuch routingprotocols w ith m 
transitlatency. 

6.1 Signalling Problem 

Ou rprim arygoalin selectinga signallingdisciplinpcisi eon tsansm itc 
w ith m inim um latencyw hile affordinghigh B^annddwt ti ds tdi mTh e ttrea m ts i t la 
the com ponentinpu t/ p w ipludt d p teermdc f, n the design ofour signallingdi 
w ell a s the basic technologies available forim plem entation.Since w 
com puter netw orks w ith inter -router delays w hich grow as the syst 
w ith signallingover poterotriaildy kolhngd n .t eAsca consequence,our interc 
m edium w illbehave like a transm ission line, and oursignallingpro 
line design problem . 

6.2 Transmission Line Review 

In this section, w e review the salient features of tWD84| farission line ; 
a thorough treatm entoftransm ission lines. 

Form ostphysicalinterconnection m edia in use in digitalsystem s 
ation and phase distortion can be ignored.Ifw e furtherignore the ba 
there are tw o prim arycharacteristics w hich des cimpbdarKse transm i s s i < 
a n propagation velocity. 

The propagationf,v£lica(ra tcy,t erizes the speed at w hich electrical w a 
interconnect. Wh en the voltage across a transm ission line changes a 
propagates dow n the transm f i sTh ieo pi r Id pa giatt ibmrvatocityis determ ined 
m aterials in the interconnectand is given by Equ ation 6.1. 

1 



(6-1) 
/fie 

Form ostm p, fce^ji w 1 h e^ is the perm ittivil jcce ffCe ai yqpn tionalprinted-circu 
boards (PCBs <t h= & r n§, w h eep &4an<djis the dielectric constankn f>fr\\e e s p a c e . \ 
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w h ecri s the speed oflightin a vacuum .Standard PCBs , tth cu ie t, th ai \ltf a propa 
the speed oflight. 

The characteristic transm ission line im pedance arises from the d 
pacitance betw een the signalconductor and its associated ground 
ofthe interconnect is fairlyconstant,the distributed inductance an 
as w ell.The distributed inductance and capacitance give rise to a ch 
imped £$. W&i ,ile transient voltage w aves are propa gatingdow n the segr 
line^Qdefines the transient voltage to t xZ\) riss iae fiu tn ocit ira ei not frtai tdoc.onductoi 
geom etryand the dielectric m aterialsinvolved.Form osteon ventiona 
connect ge o^as 50 HelsSQ. Zq = 50Q. is a conventionalstandard for PCB in t e r c 
and high-perform ance cabling. 

As longas the interconnect presents a uniform characteristic im ] 
w ave propa gates alongthe int eyr Reoanl ri b cet ractovo Inoecci t yh o w ever,is notinfi 
long, so w e m ustconsiderw hathappens to the signalw hen itreach 
line. To understand w hathappens, letus considerthe m ore generalp 
ourpropagatingw ave encounters an im pedance discontinuity. Wh e n 
encounters an im pedance discontinuity, the discontinuity gives rise 
ofdiscontinuity, part ofthe voltage w ave m aypropagate through the di 
m ayreflectback to the source. In generalw hen w e have ¥jn i n c i d e n t vc 
from a region ofinterconn e.Z^ t av ai trie gho p e*vd iat h riSn, t^s m. dr a fleece t e d a n d 
transm itted vol fl^gEn\#7q a'ffs.go vern e d by Equ a t i o n s 6.2 and 6.3. 

Mirth 

We can see from these e qu atio n s th a t.ifw e w antoursignalsto be tran 
points, w e need to keep the characteristic im pedance ofthe interco 

If transm it a sign al fro m a driveratone end go fpa p w> i ri e e t e nad r, et h e i ve r 
sign allin gis poia.tltbq>dint. Point-to-pointsignallingcan be contrasted to h 
sign allin g w here there caerie^esrs vanadlieveral potential drivers. The po 
s i gn a 1 1 i n g c a s e i s usri vk epr Iset n nod . Sindleirtlgbsigneien routers is betw een a 
forw ard and backw ard portpair,w e w illfocus the rem ainderofou 
s i gn allin g. 

In a point -to-pointsignallingsituation,w e generallyengineerthe w 
acteristic im pedance for its entire length. The endpoints ofthe tran 
ourprim aryconcern.Figures 6.1 th ro u gh 6.6 show an incident voltage w 
reflection scenarios dependingon the ratio ofthe term ination resis 
im pedance. If the end ofthe transm issa o(n7 te l rj i ri n>e> l&o)q Fp gunr -a; 6r2), u i t e d ( 
the reflected volt tfega swt hsres,am e as the incident w aven dTlp © irre t e i ve r at 
thus sees a v¥/lifaalg^eo §/ ingthe arrival ofthe voltage w ave. Wh en the trar 
is s h o r t -c i n'a?.u (Ef e 4n (<< Zq), Fi gu r e 6.6), the reflected vo^tiasgehw a aem e 
m agnitude as the incidentw ave bu teocpep VEsriaeau po Mairci td\ eTlm t w a ve . Wh 
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Position along Transmission Line 



Attimi^Othere is a voltage tran sW/ta <b tih fc os m uQtaae end ofthe transm ission 
line.Show n above is the voltage profile alonga transm ission line interc 
fir s t tran s i Ld.i®a<d < -). 



Figure 6.1ilia 1 Tra n s m ission Line Voltage Profile 



<d 2V 



Position alon 



Trans mi ssion Line 



Show n above is the voltage profile alonga transm ission line interconne 
w ave show n in Figure 6.1encoun Z&f$£ ^sOnZ§)p e tih <e ifacrueili fl ofthe 
transm ission line.The voltage profile show n is characteristic ofthe lin 
transittim e across t h'.e Sn<t i <c2&)n n e c t ( 

Fi gu re 6.2: Transm ission Line Voltage: Op en CircuitReflection 

the term ination resistance exactlym atches te'te. e (t^^^n * r*?o),s s i o n line in 
Fi gu re 6.4), there is no reflected w a ve. If the term ination resistance is si: 
the transm ission line im pedance,the reflection w illtend betw een tr 
and 6.5). It is im portantto note thatthe reflected voltage w ave w ill ret 
the transm ission line and encounter the sam e reflection scenario v 
defined bythe im pedance seen at the source end ofthe transm issio 
continue to propagate alongthe transm ission line untilthe line read 
defined bythe e ri.ei pnDthess-teadystate,the voltage level alongthe w ire w 
voltage w hich the w ire w ould possess ifthe transm ission line w ere 

For hi gh -speed signalling, w e w antto engineerthe term ination to a 
tio n s .Th a t is ,w e w antthe des ttitiea t©oth end proeicnttvbol 6a;ge as quicklyas p< 
and rem ain there. Tw o com m on m ethodsforachi eHiim gtahries go a 1 fo r j 
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Position along Transmission Line 

Show n above is the voltage profile alonga transm ission line interconne 
w ave show n in Figure 6.1encounters an hi grE^r^ i;m Zp) adance term inatioi 
the fa rend ofthe transm ission line.The voltage profile show n is charac 
duringthe second transittim e a/.e. r4 Ss! <h^4).i nterconnect ( 

Fi gu re 6.3: Transm ission IZk&rfi ^V^Ra |e c t i o n 



a> 2V 



Position along Transmission Line 

Show n above is the voltage profile alonga transm ission line interconne 
w ave show n in Figure 6.1 encounters a m at cJ| e e„d=i i?b)pi tdance term inatic 
the fa rend ofthe transm ission line.The voltage profile show n is charac 
a ft e r the firsttransittim e acr d.& s£ t>h-ne). interconnect ( 

Fi gu re 6.4: Transm ission Line Voltage: Ma tched Term ination 

series termination a n parallel termination. Fi gu r e 6.7 show s a parallel term ination arra 
and voltage profiles at both ends ofthe transm ission line w hen the d 
p aralle 1 term ination, w e selectthe driverso thatitcan drive the trans 
age and selectthe term ination resistance m atched to the transm issi 
(Z term = Zq). Avoltage w ave origi nates atthe driverand takes one transm i 
to arrive atthe receiver. On ce the receiver sees the voltage w ave,the 1 
voltage until the next transition occurs. Figure 6. 8 show s a series term 
is selected to drive the line voltage to one-halfofthe desired voltage tl 
equalto the line ii5|„Pe e=dZa)nacned( the receiver is left $tVtr% Tt>cZ$)£ u i t e d ( 
He re the one-halfvoltage w ave arrives atthe destination and reflects 
thus sees a full-sw ingtransition w hen the one-halfvoltage w ave arri 
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a> 2V 



Position along Transmission Line 

Show n above is the voltage profile alonga transm ission line interconne 
w ave show n in Figure 6.1encounters a lo wZ&^i^n-Zp)® (4 ahnec e term ination ( 
far end ofthe transm ission line. The voltage profile show n is characte 
duringthe second transittim e a/.e. r4 Ss! <h^4).i nterconnect ( 

Fi gu re 6.5: Transm ission IZk&rfi <V<^(iRa |e c t i o n 



a> 2V 



Position along Transmission Line 

Show n above is the voltage profile alonga transm ission line interconne 
w ave show n in Figure 6.1encou nZ%^.^ s<s; s?(h) a>1rtfcerfairietr(d ofthe 
transm ission line.The voltage profile show n is characteristic ofthe lin 
transittim e across t h'.e Sn<t & <c2&)n n e c t ( 

Fi gu re 6.6: Transm ission Line Vo 1 1 a ge : Sh o r t Ci r c u i t Re fle c t i o n 



tim e a ft er the source drives the transm ission line.The reflected w ave 
the transm ission line one transittim e later or one round -trip transi 
o riein a 1 o n e -h a lfvo lta ee w ave. Wh en the reflected w ave arrives atthe s 



Zdr 



and no further re fle ctions result. 



6.3 Issues in Transmission Line Signalling 

No w thatw e have review ed the keyfeatures associated w ith transm 
consider the issues associated w ith hi gh -speed, point -to-pointsign a 
line design.Byusingseries o r p arallel term ination, w e can controlth 
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Show n attop is a parallel -term inated transm ission line. Be low the trar 
voltage profiles seen bythe source and destination ends ofthe transm 
d rive r fo rc*s|/ctrQi n s itio n atthe source. 

Fi gu re 6.7: ParallelTerm inated Transm ission Line 
transm ission line so thatthe destination settles to the desired volta 



T t 



L 

v 



(6.4) 



As show n in Equ a t i o n 6.4, the transit tim e depends on t fi,e length oftl 
and the rate ofsigna lfp Fcopna gEqu (a lii,o n 6.1w e know thatthe rate ofpro 
depended on the properties ofthe m aterials.Form osteon ventiona 
v m 5 . Hi gh -p e r fo r m ance substrates w ith slightlyhigherpropagations s { 
and reliabilitycurrentlylim its their use to sm all, hi gh -end designs. 

Akeyissue to guaranteeingthatthe destination end ofthe transm i • 
desired voltage levelin a single transit tim e is proper term ination.In 
term ination cases, w e require a term ination w hich is m atched to the 
im pedance. Process variation in the m anufacture ofprinted-ciruitb 
plicates the ease w ith w hich w e can achieve m atched term ination 
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Show n attop is a series -term inated transm ission line. Be low the tran 
voltage profiles seen bythe source and destination ends ofthe transm 
driver fo r c -e> sT/a r(h n sitio n . 

Fi gu re 6.8: SerialTerm inated Transm ission Line 

w ill only guarantee the im pedance ofthe E£B%i gTi ghltKfcurmiB nwd Bt h i n a b o 
can be specified butalw ays athighercosts. Ad dition ally, there is the 
term ination is fabricated. <ExI eirtnrae Is, issut ©Base a r-nr esistor packs can be u; 
m ination w ith m oderatelyhigh accuracy. Ho w © vue n ,tt e d rm ipi (a ii i© m t fo r a 
such as aroutingcom ponent,can r eRpBi fTh ias st raarb si fea A e es a ro tro tchoes t f o r t h 
term ination com ponents,forthe PCB real -estate, and forthe added co 
space required forterm ination also translates into larger distances t 
lo n ger tran sit laten cyan d low er reliability. Ext e r n a 1 , fixe d resistors als 
to reverse the direction ofsignaltransm ission across our transm iss 
reverse an open connection in ournetw ork. 

An other keyconcern w hen drivingtransm ission line interconnec 
drive the transm ission line.The pow er supplied bythe driveris dete 
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and resistance seen bythe driver (Equ a t i o n 6.5). 

-L drive — ^ V / 

Aparallelterm inated transm ission line w ill d is sip a te p o w eras sho 
the transm iss Fo n line to 

_ yl 

^parallel -drive — ry V J 

Aseries term inated transm ission line w ill dissipate pow ergiven by 
trip transit tim e follow inganytransition. On ce the reflection returns 
dissipated in the stead y-s tate condition. 

_ yl 

± serial -drive — rv V J 

Ad ditionally, pow erisrequired to charge the capacitance associated 
gi ve n b yEqu a t i o n 6.j8jw tthffirfe equencyatw hich t hAa^d^rirvs rhssvvbtbth ges , 
sw ingdrivinginto th (e^d^SVii"^ f haencda pacitance w hich m ustbe charged o 
to change the voltage on the driver. 



1 2 

± charge — ~Z^ driver \^* driver ) J V J 

6.4 Basic Signalling Strategy 

To m eetthe needs ofpoint-to-pointsi gra acblerpgwbilh fa (bgfri sep ,ewe d an d 
a series -term inated, low -vo ltage sw ingsignallingschem e w hich us> 
feedback to m atch term ination and transm ission line im pedances. 
discussion, w e CfaQS iins te grate d circuittechnology. 

Lo w -vo ltage sw ingsignallingis dictated bythe need to drive the res 
load w ith acceptable pow er dissipation. We see in Equ ation 6. 5 that 
sw ingsaves pow er qu adra tic ally. In the designs w hich follow ,w e spe 
betw een zero and one -vo lt.Lim itingthe voltage sw ings to one -vo It s a ve 
over traditional five -vo/let si^g^fidrliir^giSOm W w ith five -vo ltsignalsw ings and 
P S ertai_drtve = 10m W w ith one-voltsw i n gs ). 

To achieve one -vo It sign allin g,(wnee pfe w di ehcao ran e -vo ltpow er supply 
purpose ofsign ailing. Th is freep Ohnee imtd ifFbdnni anl e; eodrri ngto convertbetw 
logic supply voltage a M ri gfaoel tsai go he ve 1 . An ypow erconsum ed generating 
supplyis dissipated in the pow er supply, and notin the individual ICs 

Series term ination offers severaladvantages over parallel term ina 
n allin g. We can integrate the term ination im pedance into the driver.Ir 
w e needed to drive the transm ission line aopl p al ger ac ilia Th et ce ffb e ta i/gn alii 
resistance across the driverbetw een the supplyrails and the driven t 
com pared to the transm is s.2ooim Id m A eimt p eddr ii\re ctii ,e transm ission line 
close to the s iugp f> lllyi (iSeg & F6.g>). rim a:MOS im plem en tatio n ,th is m eans thattl 
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Show n here is CfAhOS tbr a b Bern ission line driver.Show n atrightis the basic ( 
Show n on the leftis a sim plified m odelofthe driver m akinge xp licit t r 
transistor, w hen enabled, can be m odeled as a resistorofsom e resista 
transistor's W/ Lratio and process param eters. 

Fi gu re 6c*klOS Transm ission Line Driver 



oft he transistors im plem entingthe finW/lLd a: ivortna m satMiea Ifh a iasgstan cf 
sm all. As a consequence,the fin a 1 d r i ve r i s large and, there fo re, has c( 
p a c i t a fidfi'tjgrr This m eans itw illtake additionaltim e to scale the drive 
signalup large enough to drive the fmaldriver. It also P^a r ^ e ans that the 
(Equ a t i o n 6.8), w illbe large.In contrast, the series term inated driverca 
driver. The higher im pedance ofthe series term inated driver allow s 
s m a We' J; ratio and hen jfg^^ra nlildffss latencydrivingthe output. 

The series term inated con figu ration gives us the opportunityto use 
chip, series term ination to m atch the transm ission line im pedance. 
line im pedance and the conductance ofthe drive transistors to vary 
m onitoringthe stable line voltage duringthe roiitiid ktrriaonts' atncs ht litm e b 
the source end ofthe transm ission line and the arrivalofthe reflectio 
w hetherthe driverterm ination is high,low ,orm atched to the transi 
a properlyterm inated series transm ission line,w e expectthe vo 1 1 a ; 
ground and rlhien igifgip dyduringthe fir stround -trip tra ntat 1 fc Si m a clfit he vo 1 1 ; 
above the hal f w aypoint,the drive im pedance is too low .If the voltage 5 
w aypoint,the driverim pedance is too high.Bym onitoringthe voltage 
the drive im pedance appropriately, the integrated circuit can com p 
both the silicon processingand PCB m anufacture to m atch the term i 
i m p e d a n c £MCMge slcuit designers are fam iliarw ith the practice ofdes 
com pensate forthe w ide variations associated w ith silicon process 
technique takes the strategyone step further to com pensate for var 
externalenvironm ent. 
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Functionally, w e w antan adjustable resistance forthe path to both the h 
rails. To drive the outputto a particm ianrescigrt hi 41 arpgpriari p, MEte rail to 
the outputpad via the tuned im pedance. 

Figure 6.10: FunctionalView ofControlled Outputlm pedance Dr 

6.5 Driver 

To allow ad ju stable driverim pedance, the outputdriveris designed 
line on the a chip's outputp aidp pol yt hher e iuggha hi icnogrs tro liable im pedan 
Lo gi call y, this con figu ration is show n in Fi gu r e 6.10. Sd Verr g t hoep t i o n s e xi 
variable output im pedance. Kn i gh t a n d Kr ym m s u gge st controllingt 
controllingthe gate voltage on the fin[KK8&]t |6gee oFiigupr b t6dl r)i \Brrasn son 
s u gge sts u singe xponentiallysized pastlgm tgep pblad w aeie <4 ttBn e a iighpau t p a d 
[Br a 90]. The im pedance is controlled byonlyallow ingthe appropriai 
turn on to achieve the desired im pedance. Gabara and Kn a u e r s u gge s 
equivalent usinga set ofexponentiallysized atcr h nosfi 6h gk rp li 41 -p ;pr a hie 1 i : 
pull-dow n netw orks to allow digitalcontrolofthe outputim pedanc 

De Ho n , Kn ight,and Sim on consider a variant that places the im ped 
and the gatingtransistor in series betw een the ^DL^S9S]1 (Seuep plyand t h < 
Fi gu re 6.13). Th i s con figu ration achieves low er latencybym ovingthe im 
outofthe critical si gnal path through the outputpad. Un like the oth 
im pedance schem es,the im pedance settingis controlled separatel; 
static duringoperation.The generation ofthe pull-dow n and pull-up 
m ustbe perform ed in the signalpath to the fin a 1 s t a ge ofthe pad drivi 
im pedance controldevices do notchange w ith each data cycle,less 
the fin a 1 s t a ge ito^e r s , 

In allofthese driver schem es,w happhpih aghd s vgn a ltlhi n egss h old drop 
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Show n above is a voltage-controlle dMD£>dnrtirve tlfe d mm[IpK$&|.aB9' c e 
va r yi nVg, n tro/ below the logic supply voltage, one varies the gate voltage appl 
finaldriverw hen itis enabled. Mo dulatingthe gate voltage in thism anne 
conductance ofthe fin aldrive rand hence the im pedance seen bythe ti 



Fi gu re 6. IdMOS Driverw ith Voltage Controlled Outputlm pedance 

more below the hi ghN-Hoegiicc s i$ pc pi hy,b e used to form the pull-up netw o 
the pull -down network. Thatis,when the internal logic can drive tl 
fmaldriverm ore than a threshold abo veutpi p ld/,a Is foreecdoehm ghsssiignyia 1 1 i n gs 
t o u s ©-da evice pull-up to allow the outputto sw ingallthe w ayup to t 
NMOS devices have si ze, speed, and pow er advantages. Sihoaithe m obil 
tw o and a halftim es the m N-db i Mtcyeo wh kt 11 eas ga\ie n transconductance,ar 
im pedance, can be roughlytw o and a halftim e-d es uicael We. i ttrti ahiea c o r r 
sam e transconductance.The sm allerdevices presentless capacitan 
internal logic and hence operate faster w hile dissipatingless pow e 
driverlayoutbecom es sm allerand sim pier since tJadee Mcadsd. river is b 
Ou tputdrivers that R-& feyvocne tsNadlnhdvi ces require guard-ri ©Ivfps la e tlw e e n the 
NMOS devices to protectagainstlatch-up. 

Figure 6.14 show s a sized version ofthe output driver show n in Figi 
i m p 1 e m e ffiM©Scl6,i He w lett Pack airedffs c0t8 ve gate-length process [DKS93]. Th i s 
output driverexhibits a 2ns output latencyand Ins rise/falltim es. 1 
capable ofm atchingexternali mQp e dl aSBObln st <b teathy tfeasndJQver consum es 
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Show n above is a digitallycontrolled variable resistance driver from 
a c t u all y RM6)Se devices in both resistance networks since itfocuses on o 
a d i ffe r e n t vo 1 1 a ge r e gi o n ). Th e fliLgvtip&dawaelwL re $dampedance 

determ ine w hich transistors are enabled w heneverthe driverdrives ; 
respectively. The transconductances ofthe enabled paralleltransistors 
the transconductancelU fent wr a elna tihde tshisgnoau t p u t p a d . 

Fi gu re 6. KtolOS Driverw ith Digi tally Co n trolled Ou tputlm pedance 



appro xi m atelylOm W+ 2m W/ 100MHz o f p o w e r . 



6.6 Receiver 

The receiverm ustconvertth enlp> uvt -svbght a gtf cs w fnnlgis w inglogic signalfo 
inside the com ponent.In the interestofhigh c-sepi eeerdws \v i ctfc rh iansghwi gh w i 
gain forsm allsi gn aide viations around t h d lim gdp-p (biienst. tjfiXJSw e e n the si 
and [KK88] bndt nee suitable d effeirve n st i aR6.g^ e <h o w s oneism icvfa rr Th e 
r i gh t m o s t i n ve r t e r p a i r (II a n d 12) i n Fi gu r e 6ek5 foi Fanr sb aadsiefS it©ntriia}lw hen 
the input voltage seen byctfeasdiabatfplad leoxwp -pd y.t M gensd 12 are identical 
inverters. The ienapcuht sat® taken through resistors to w hoait nvdo u 1 d n orit 
connections ofthe inverters.The resistor betw een the pad and 12 is t 
resistor w hich mCMOStieixi aittopm ds. The resistor b e tW iaiegivotlht agbalfsigna 
leveland His an identicalresistorforreference.The norm alinputa 
structure are shorted togetherso thatllserves as a bias generator phi 
ofoperation.Ifthe inputpad voltage connectedltn £2iw 1 1 ar ge al bsw la, t h a 1 f I 
the tw o devices w ould be in identical voltage states and the outputof 
also be m id-range.As the pad input voltage seen byI2varies aw ayfrom 
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Show n above is a digital controlled -im pedance driverafter [DKS93]. Th e i 
o npuJmpedance a n fldJmpedance enable the parallel im pedance control transis 
Drivertransistors are placed in series betw een the im pedance control 
putpad.The desired s i gn a> hi hnegcvbeldatgB ils c p a d byenablingthe appropric 
drive transistor. The digitalim pedance controls rem ain static duringn 

Fi gu re 6. tSviOS Driverw ith Separate Im pedance and Logic Controls 

Hand I2rapidlybecom e unbalanced leadingto a high -ga in output fr or 
is slightlyabove the halfvoltage 1 e ve 1 , t h e sw itchingthreshold of 12 bee 
supplied b y II. Th e bias on the gates ofI2devices appears like a low in 
drives a high output. Si m ilarly, ifthe pad input to 12 is slightlybelow th 
sw itchingthreshold ofI2becom es low erthan the II bias causingthe '. 
In response, 12 drives a low output. Fi nail y, 13 serves to restore the reco 
to a fullrail-to-railvoltage swingfordrivinginternallogic.In order for 
should be sized so thatits m idpointvoltage tracks the m idpointvolt 
va r i a t i o n . 

Fi gu r e 6.16 s h o w s a ve r sei o aa i wefrt hhaw n 6ih5 #i guiiceh w as im plem ented 
i nCMOS26, Hew lett Pack airedffs c0t8 ve ga t e -1 e n gt h p r o c e s sn [p)KSt93]a tThnec y 
through this receiver is appro xim ately^l, jfes r Th e st net a & i /Ve rl a tiedn tchyp d r i ve r 
described in the previous section is 3ns. 
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All transistors are 0. 8u m I o n g . 

Show n here kSMffiSsdiEC vfe r circuit from [DKS93]. Allw idths are show n in m icr< 
This driverw as d eMQSgM>,<Hd wo Ire 1 1 Pa c k ^ a: dffdsc Q.8ve gate-length process. 

Figure 6.14:Controlled Im pedance Driverlm plem entation 

6.7 Bidirectional Operation 

The drivers and receivershow n in the previous secdiitDpilse ot,a n be con 
bidirectionalsi gn ailing, as neecpaadi forr ttshce ersocur i bnegcb onmCh ap ter 4. Asing 
pad w ould contain both a driverand a receiver.Atanypointin tim e c 
line w ould be con figu red to drive the line and the other to receive, 
both its pull-dow n and pull-up enables turned off. In this m ode,boi 
inpieteieiver look like h i gb d m epcetdcam s; eoct h e transm ission line. The re 
behaves as the high-im pedance connection w e expecton the destina 
transm ission line.The drivingend ofthe transm ission line drives eit 
enable connectin guap spi gji taol ltihi egiransm ission line through the adjust 
n e t w o r k . Wh en it is necessaryto turn the connection around to rev 
netw ork,the i/o pads can sw ap roles as driverand receiver. 
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Shownaboveisar ^KKSSJvdrr -aeft 4 ilras nffia re identica FKLffel v± c e s ( 

WP2, WN\ = WN2). 71 b i a sl&i n t o its h i gh -ga in r e gi o n . Wh en the voltage on the 

inputpad is sligh tlyh igh e r o r lo W ienrgt\h) d tia tgb er dill k fan grfekfie s 

the vo 1 1 digs tandardizes th £2 (four tip siet <bnf s ide the com ponent. It should be 

sized to have the sam e vo UlaagB M. idpointas 

Fi gu re 6. tfMOS Lo w -vo ltage DifferentialReceiver Circuitry 

6.8 Automatic Impedance Control 

The drivers described in Section 6.5 all allow ed the output i m p ed i 
section w e turn our attention to the task ofautom a tic a llym atchingt 
attached transm ission line im pedance.In anysuitable schem e,w e 
w hether the term ination im pedance is high or low and a m echan 
inform ation back to upd attta rt lg.eStiarrr tp te ^\a ri tche tfheee d e tiwsirsdaenscfc iri b e d 
in this chapter, w e can obtain the inform ation necessaryand close 
a discrete -tim e sam pie register and allow titri tggandsssatm phi e vanl ip as si a l 
through the test -access port (TAP) (Se c t i o n 2.2 and Ch a p t e r 5). 

6.8.1 Circuitry 

Figure 6. 17 show s the scan architecture fora bit doi netcottb e a iasnghaarldp,a d 
b o u n d a r y-s can eraecghs peard ,has an im pedance controlregister and a sa 
im pedance register holds the digitalim pedance setting for the pull- 
in im pedance controls schem es such as the ones show n in Figures ( 
registercan but nwd tei it tsecna n controlthrough the TAP t o c o n figu re the p u li- 
ne tw ork im pedances.The sam pleregisteris shtd) own no icnc M gt$ toe n6tlB. eWh e i 
logic value to be driven outofthe pad, an enable pulse is fed into the 
ripples through the in verter chain enablingeach nspa imt pal leu ree Igws be r t o s t 
inverterdelays apart. The digitalinputvalue to the sa mcpe ks/erie. gi s t e r c o i 
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ig n a I 




V, 
logic 

T 

o| I 4 r-ci 110.4 

[-► In p u t t o Ch ip 

20 ■ -I |4. 8 



In p u t 
Pad 

All transistors are 0. 8u m I o n g . 



Show n here is the differen P3>KS98]2 (Abli we r dfit hmare show n in m icrons. 
This driverw as d eMQSgM>,<Hd wo Ire 1 1 Pa c k ^ a: dffdsc Q.8ve gate-length process. 
Grounded-Pinverters are u s e <4 d b it\he ff d ai fib e er it h iaa rl ic (EMias plem entary 
i n ve r t e r s as in Fi gu r e 6.15. Exp erim ental evaluation su gge sts that the geo 
transistors used for the di flfeor a hdt i b b rlea crgdreirnsorder to provide good 
stabilityto processing variation. 

Fi gu re 6. tfMOS Lo w -vo ltage,DifferentialReceiver Im plem entation 

Thisreceiverm aybe the sam e one used forreceivingsignpaikst.w hen the 
The keyrequirem entisetbat \tehregd m p m d: tre s one logic value w hen the pa 
above the s i gn p pi 1 jirg si d -p o i n t vo 1 1 a ge and the opposite logic value w he 
below .Follow inga transition ofthe outputlogic value,the sam pie re j 
ofcloselyspaced tim e sam pies ofthe digitalvalue seen bythe receive! 
read under scan controlthrough the TAP to provide a digital, discrete -t 
the outputpad.Figure 6.19show s the com bined i/o pad circuitry from 
provides access to read the sam pie registerand w rite the im pedanc 
o f a n a 1 yzi ngthe data recorded bythe sam pie register and selectingth 
o ff-c hip controller. 

6.8.2 Impedance Selection Problem 

Thegoalisto settheoutputim pedanceto achievem atched series te 
line w ere ideal, the signalshad no appreciable rise -tim e,and theroui 
line w as definitelylongerthan oursam pie register,the im pedance sel 
voltage at the pad ofa m atched transm ission line w ould look like Fi 
high transition. If the series resistance w as a little low erthan optim a 
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Figure 6. 17: Bi directional Pa d Scan Architecture 
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Asim pie sam pie registeris com posed ofa sequence oflatches each ei 
delays apart. Wh en a transition occurs on the output p ad ,a short ena 
into the sam pie register.Each sam pie latch records the value seen byt 
w as last en ab led . Aft erthe enable pulse propagates through the sam p h 
registerholds a discrete -tim e sam pie ofthe value seen bythe receiver. 

Fi gu r e 6.18: Sa m pie Re gi s t e r 

w ould settle a little above the m id -pnopiui tt (yea lit a gn gitrh cb fcrai mtphlesiregister 
read allones.Sim ilarly, ifthe series resistance is a little higher,the lin 
trip pointand the sam pie registerw ould read allzeros.To setthe pull 
through various im pedaerac&it teiaatgiw gs c At n figu re the outputim pedance 
force a transition ofthe output usingthe scan capabilities. Fo How ii 
the value ofthe sam pie register,again usingthe scan TAP. Wh en we fin d 
w here the sam pie register changes from readingallzeros to reading 
the appropriate pull-up im pedance setting. The sam e basic operatio 
transitions to con figu re the pull-dow n im pedance. 
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Abidirectionalpad w ill c o n ta in both driverand receiver circuitry. Sho 
bidirectionalpad con figu ration integratingthe d Eicvc li deert ailed in Figure 6 
detailed in Figure 6.16. 

Fi gu re 6.19: Dr i ve r eacnedi iKr Con figu ration for Bi directional Pa d 
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Show n above is the voltage atthe source -end ofan ideal, m atched serie 
m ission line follow inga low to high transition from the driver. 

Fi gu re 6.20: Ideal SourtciefTrans 

Un fortunately, there are m anynon-idealeffectsw hichtccamsnotbe ign< 
look m ore like the ones show n in Figure 6.21. Si nee there is a finite rist 
real sign al.th e driverw illalw ays require som e tim e to drive the trans n 
point. The sam pleregisterw illnotcontain allonesorallzeros. Du e t o 
CMOS process, the tim e betw een subsequentsam pies in the sam pie r e £ 
a sam pie bitrelative to the o u t p u t t r a n s i tp gd m mnat jt iz>a <r y> wn ipl ® hyefmot mThcia l 
variation can easilycause the inter -sam pie tim e to varybyas m uch as 
ittakes finite tim e forthe si gn alto getfrom the inputpad to the sam p 
sam pie w ere taken w hen the output started changing, severalsam p] 
inputto the sam pie registerreflects the voltage on the outputpad.Pro 
this skew betw een the pad volta gee i Facto skitt it© nvaarnydr Ohme com ponenttt 
com ponent. 

As a result, the sam pie values returned from an im pedance scan 
Table 6.1. As the im pedance decreases, the line does trip from low tc 
w here the low to high trip occurs becom es earlier as the source in 
com m ensurate w ith ourexpectations ofa finite rise -tim e. Eve n t u ally, 
This is an indication ofthe num ber ofsam pie tim es w hich elapse b 
sam pie register and the arrival ofthe fastentppistd Thr anestateiOmutrnrbffii rgh t 
ofbittim es this re qu ires w illvaryfrom com ponent to com ponent d 
settingthe controlim pedance requ Krge sTa" bad ty. lit atm & a dt h m st idfjattha e( b e s t 
im pedance setting for proper series term ination. 

6.8.3 Impedance Selection Algorithm 

Aheuristic strategyw hich w orks w ellin practice forselectinga m i 
atthe derivative ofthe seagm Tja bel ip l5d)ni aid center in on the center ofthe 
derivative region. Th at is, we look atwhecacthes arm rpslbt iaonnd ot<a (kic rlshien 
deltas betw een im pedance pairs. The search focuses on findingthe 1 
the largestdeltas betw een an adjacentpairofim pedance values.For 
trip -point, the sam pie registerw ould nevertrip and above,itw ould i 
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Impedance 




Setting 


Sampled Data 





0000000000000000 


1 


0000000000000000 


2 


0000000000000000 


3 


0000000000000000 


4 


0000000000000000 


5 


0000000000000000 


6 


0000000000011111 


7 


0000000011111111 


8 


0000001111111111 


9 


0000001111111111 


a 


0000011111111111 


b 


0000011111111111 


c 


0000011111111111 


d 


0000011111111111 


e 


0000011111111111 


f 


0000011111111111 


10 


0000011111111111 


11 


0000011111111111 


12 


0000011111111111 


13 


0000011111111111 


14 


0000011111111111 


15 


0000011111111111 


16 


0000011111111111 


17 


0000111111111111 


18 


0000111111111111 


19 


0000111111111111 


la 


0000111111111111 


lb 


0000111111111111 


lc 


0000111111111111 


Id 


0000111111111111 


le 


0000111111111111 


If 


0000111111111111 



Show n above is sam pie data fora 16-b i t s a m ptlien igc gprctaisi The im pedance s 
to the binaryencodingofthe enables for 5exponentiallysized im pedai 
im plies allthe transistors are disabled, w hile lfi n/d.i cates th at all th e t 
the low estim pedance setting). 

Table 6.1: Rep re se n ta tive Sam pie Register Da t a 
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Show n above are a series ofm ore realistic depictions ofthe voltage w 
the source end ofa series term inated transm ission line. At the t o p , w 
impedance situation. The middle diagram shows a case where the 
im pedance is too large,and the bottom diagram s show s a case w here 
is too sm all. 

Fi gu re 6.21 : Mo re Realistic $d) ho mc se Trans 

the change occurs would clearlybe the point where the largestdel 
im pedance setting. 

Naively, w e could scan throu gha Icoeonkt iirmg p e dlyaant caed p airs. We could id 
the pairofim pedance values betw een w hich the largestdifference o 
im pedance settings to con figu re the im pedance netw ork. Ho w e ve r , s 
strategycan be m isled.Itisoften the case thatthe difference betw een 
perhaps one ortw o bitpositions.Thatm akes itdi ffic ultto decide w h 
- e.g. considerthe case in w hich there is a run of five im pedance setti 
position, fo How ed by five im pedance settings all identical, then a difl 
and no subsequentdifferences.Apair oriented algorithm w ould sele 
change is reallyin the m iddle ofthe five im pedances w hich differbyo 

In stead, w e use an algorithm; eve h si cvb llyosank s la A is iim pedance inter va 
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Set Impedance: 










1 


oldJiimped <- 


- m i d d 1 e 


i m p 


e d a n c e 


va 


2 


oldJimped <- 


-0 








3 


limped <— m 


i d d 1 e i m 


p e d 


a n c e va 


u u 


4 


himped <— 










5 


w h i 1 e t h 


e old an 


d c u ] 


r r e n t i m 


p i; 


6 


oldJiimped <— himped 








7 


himped <— 


fin dh i gh m 


p e d 


Umpxd) ( 




8 


oldJimped <— limped 








9 


limped <— 


fin dl o w m 


p e d 1 


iimped)( 




1( 


) s e t i m p 


e d MmptEck, himped 







lu e 



dances differ 



Fi gu re 6.22: Im pedance Se lection 



i t h m (Ou t e r L 



attem ptto zero in on the largestdelta.To avoid m issinglarge gaps w hi( 
the recursive search divides the region into pieces and recurses on 
space w ith the largestgap.Further,since the va 1 tut En (g fitch e sobpapve site i m 
a second -ordereffecton the otpitri gi, va le i ht epr a dlea hhcreo siegh searching for th 
and low -im pedance settings untilthe 6s2(2 Huettia)iri ss tehoenbvaisgec. figra ret h m 
forconvergingon a pairofim pedance values.Figure 6.23 describes the a 
in on an im pedance value usingthe heuristic strategyjustdescribed. 

6.8.4 Register Sizes 

Wh en adaptingthe im pedance control strategy described here to s 
process, it is im portantto consider the am ountofgranularityavaila 
and the tim e window covered bythe sam pie register. The num bero 
and hence the num berofbitsused bythe im pedance controlregiste 
im pedances to w hich the pad needs to m atch,the potential proces 
m ism atch w hich is considered negligible.The num berofbitsin the s 
the size ofthe w indow required toi giioarr aaitfeecthrdtfctihaetta - alnpsr ocess corn 
one could d oatpgriDipe ixa e 1 1 y w hen to sam pie the output, onlya single bit 
w ould be necessary. Ho w ever, since processingand operatingtem p 
tim ingofthe inputand outputcircuitry, the num berofbits in the sam 
such thatthe w indow spans allpotentialtim ingvariations. 

6.8.5 Sample Results 

The test com ponen f:DKS9i5£ wi b e <4 d n figu red w ith a 16-b it sam pie regis 
sixbits ofboth pull-up and pull-dow n im pedance control. Figure 6.24 
atboth endfl Crfaan 50m ission line for an im pedance selection determ 
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Find Impedance: 

1 /p^highestim pedance value w ith no tra 

2 /ip ^ 1 o w estim pedance settingw ith sam 

in the sam pie registeras the hi gh estim 

3 w h ](Xbp - lp) > 2) 
hmid <— hp — ( p 7 p J 

lmid^lp + (^y^) 

iftransition differe krmdahi^lm greera t e r t h a 
t h a t b e t knid a n Up 

hp <— hmid 

8 else 

9 lp <— /raid 

10 r e t u (r^ 



it s 1 1 1 o n 
va 1 u e 
5 e d a n c e settin 



This version ofthe algorithm divides the rem ainingdistance into third: 
overlappingtw o -thirds re gions. Alar gerfr action could be used form ore 
greater selection accuracy, at the expense ofslow erconvergence. 

Fi gu re 6.23: Im pedance Se lection oAbgp J'ith m (InnerL 

the algorithm described above.Atthe process corner represented b 
sixbits ofim pedance control w ere m ore thanQstur ffin iserm t stsoi oc 4 e a n 1 y 
1 i n e . Fi gu r e 6.25 show s the m atchingaclpioe n/asdi foarntdh ksustaorm ea tioc rim p e d a 
selection algorithm w hen few ercontrolbits are used. Of course, a dif 
curve w ould provide a differentim pedance resolution and range.Figu 
diagram s w hen the sam e pad is aut oQct raatri <s ml l^mi a tic hi n A .t o a 100 

6.8.6 Sharing 

On e option w hich m aym ake sense in m anysituations is to share 
perhaps im pedance control, betw een severalpads.Ifa group ofpads 
p h ys i c a 1 ns.^e d a an( e PCB), the externaltransm ission lines theydrive w illh 
sam e characteristic im pedance. If the pads are physicallyclose on tr 
process variation from pad to pad. In such cases, i t m akes sense to s 
im pedance controlw ithin the group ofpads. In cases w here w e canr 
aboutthe externalim p et did b e ep, ©tsw icbuddosshare a single sam pie regis 
group ofphysicallylocalpads.Such a shared sam pie registeronlynee 
to select am ongthe possible inputsources. 
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2.00 ns per division 



2.00 ns per division 



Show n above is the voltage profile seen atboth the driverand receivere 
5QQ. transm ission line.The im pedance m atchingshow n here w as deter 
usingthe algorithm detailed in Figures 6.22 and 6.23. 

Fi gu re 6.24: Im pedance Ma t c h i n g: 6 Co n t r o 1 Bi t s 

6.8.7 Temperature Variation 

Tern perature is an im portantenvironm entalfactorw hich affects t 
circuit. Tem perature affects the trans (CMOS idnut e gaaitced ocfidc edict ea sn idn ha e n c t 
the term ination im pedance. The autom atic im pedance m atchingde 
to adjustthe term ination im pedance to be m atched atthe tem perat 

Theprocessdescribed abovew ould norm allybeperform ed aspart 
f o r e a c hp ooonm nt.In som e environm ents,itis possible forthe com pon 
w idelyduringoperation.As the tem perature varies from the pointw 
place, the term ination im pedance w ill deviate from the transm iss 
e ff e c t is s i gn i fie a n t e n o u gh to a ffe d lity#tih semr a> sus then rg p a foitaobc ol w ill noti 
higherthan norm alrate ofcorrupted m essages through the com pon 
system attem pts to localize errors (Se e Chapters 5), it can rerun the n 
adjustthe im pedance for the current operatingtem perature. Am or 
taken byintegratinga tem perature sensor onto the integrated circu 
on-chip tem perature sensor, the scan controller could periodically 
com ponent. Wh en a com ponent's tem perature indication differs sigr 
indication w hen the com ponentw as last ceaclalkirbartae tie, t h e sun apnecdoamtcr e 
setting. Us ingthe port -deselection and p gd idt t±> cyepdo int Ch aiip f se <r i9, i hi e siratnr 
controller can isolate individualportpairs and recalibrate their dr 
havinga significantperform ance im pact. 
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2.00 ns per division 
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2.00 ns per division 



2.00 ns per division 



Show n above are the voltage profiles seen close to the driverand receive 
5QQ. transm ission line follow ingboth high and low outputtransitions.T 
selected autom atically. Since only3bitsofcontrolw ere used, the highi 
used to sim ulate param etervariation -in the top pairoftraces the bitw 
the bottom itw as enabled. 



Fi gu re 6.25: Im pedance Ma t c h i n g: 3 Co n t r o 1 Bi t s 



6.9 Matched Delay 



In Se c t i o n 3.2 w e pointed outthatpipeliningbittransm issions over 
pre ventthetransittimes across wires in the system from havinganega 
bandw idth and latency With the circuitrydeve loped in the previous s 
how to reliablypipeline m ultiple bits o nptbu wnitrse. s betw een routing 
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2.00 ns per division 
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Show n above is the voltage profile seen atboth the driverand receivere 
100£2 transm ission line. The im pedance w as m atched autom atically. 1 
voltage levelis the resultofthe finite im pedance ofthe m easurm entap 

Fi gu re 6.26: IMm pedance Ma tch in g:6Co n tro 1 Bits 

6.9.1 Problem 

The w ires interconnectingcom ponents in the netw ork varyin len 
tem perature variations.the exactdelaythrough an input or output p 
With arbitrarylength w ires and uncertain i/o delays, there is no guar 
arrives at the destination com ponentrelative to the system clock. If 
setup tim ejustbefore the clock rises orduringthe hold tim ejusta 
receiver can clock in indeterm inate data. To avoid this potentialpr 
delaythrough each outputpad to gua irtai d tie m rtrh laetst h e tdiigndael stlriannast i o n a 
reasonable tim e w ith respectto the clock. 

6.9.2 Adjustable Delay Pads 

To controlthe arrivaltim e ofsignals atthe destinations variable del 
the internallogic and the fin a 1 o u tp u t d rive r. Th is bufferis designed to h 
such that it can alw ays m ove the arrivaltim e ofthe signal out ofthe 
processingand tem perature. For the granularityofcontrol necessa 
variable delaybuffer could sim plybeaa; one e & ttiqo he sxe quperro o/ied drftgaps offof 
chain ofinverters (Se e Fi gu r e 6.27). For fin er control, ofcourse, a voltage c 
be used instead of, or in addition to, a variabl 6.2S)e Fa gum eu fl.£9p 1 e xo r (Se e 
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Fi gu re 6.27: Map 1 e xo r Ba s e d Va r i a b 1 e De 1 a y Bu ffe r 



i n p u t [>° — t — P> — j — [>° — r — P>- 
v c t-H- 



g n -d 




output 



Show n here is a voltage controlled delayline (VCDL) a f t e r [Ba z85] and [Jo h 88 
VCTRL effectivelycontrols the a tmv© lionatcb £ e a p abcy tda a? cohuitrpME Ir oefr 
stage and hence the delaythrough each inverterstage.The num berofs 
the VCDL w illdepend on the range ofdelaysrequired from the buffer. 

Fi gu re 6.28: Voltage Controlled Variable DelayBuffer 

show s a revised pad architecture w hich incorporates the variable c 
con figu ringthe delaythrough the com ponen t's iI¥£PrT<h ier <p ua idt rdy r Evmr aaim d 
the sam e as in the m atched im pedance pads described above. 

6.9.3 Delay Adjustment 

We can use the sam e basic strategyused for m atchingim pedance 
is ,b yw atchingthe voltage levelatthe source end ofthe transm ission 
round -trip transit tim e across the transm ission line. Si nee w e can 
the sam e tim e to propagate from the source to the destination as it 
the destination to the source, we know that the signalarrived at the 
the round -trip transit tim e.Allw e need to do i si tii© tesr rfr ci me t\h 4i e n t h 
h a 1 f-w aypointto the fullsignalvoltage railas w ell a s w hen ittransitic 

The inform ation w hich w e record in the sam pie registerw hen sc 
im pedance va lues, in effect.alre ad ypro vid e s u s w ith thisinform ation 
r e gi s t e r . Th eeicnepiiyet ir is setto trip w heneverthe voltage on the source en 
line exceeds the hal f-w aypoirlkilng tw id em tmcersrigai la operation, w e sell 
driver im pedance to m atch the transm ission linecBontriathtihe sourc 
hal f-w aypointduringthe round trip -tim e.If, forexam ple,w e w ere to s 
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Show n above is the revised bidirectional pad architecture incorpor; 
buffers in the outputpath as w ell a s a scannable registerto controlthe 

Fi gu re 6.29: Ad ju stable De lay Bi directional Pa d Scan Architectur 

three tim es the characteristic im pedance ofthe transm ission line,t 
m idpointandctE iipyetih fficrtw hen the line w as driven,butw hen the re fie 
the fa rend ofthe transm ission line.Thatis: 

I — V a rv ) 'signal 
4Z() 



v Rdst = Vi 



' Rsrc 



v Rdst 



For a trans itidO,rt haet pad voltage becom es 

Tr * si anal 

for the p e r<i o <d ^§. Aft erthe reflection returns and reflects againstthe un 
term ination, 



v = V Rsrc + V Rdst + V! 



'signal 



fo r the p Sk i<0 d < ^. Qu alitatively, the situation resem bles the case w 
term ination is too large as show n in Fiegu e s £i>l\ylifo iia tht ei tl irsi neoitmn pedani 
t o b ©To3 as assum ed,forthis behaviorto hold. As longas the driverim pe 

Zq < Zdrive < 4Zq 
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Fi gu re 6.30: Sam pie Registerw ith Serlaicttable Clock I 

the sam pie registerw ill trip a ft er the a r rAsvd b <a igtahset heet u a: m rpe Iffe c A igcsnt e r 
i s s u ffic ientlylong, w e can determ ine notonlythe im pedance setting,! 
and reflected w aves occur. 

No tice that the delayth reocueghvet h e s i Inhpeu st arm e w hen seiatrohingfor the 
to the m idpointas w hen searching fortiltiiaain. ilh pi cs i nht tf od fo lki y tahi i d niagh s 
the receiver is cancededk ionugta wtHQasndlelta tim es to determ ine w hen thi 
atthe destination. Al so note thatthis is a discretized tim e sam pie li 
gran u larity. 

We stillhave the problem s thatthe delaybetw een sam pie bits is pr 
ofthe sam plesrelative to actual si gnaltransitions is uncertain. Thes 
byallow inga version ofthe clock to be sw itched into the sam pie regi 
the input irve r (Se e Fi3gl). rial this w ay, the inter -sam pie bittim es can be < 
term s of fr actions ofthe com ponent's cloctt poenr ifo rl t Hfeb ©tillpthifcrlcagjc 
and the enable pulse on the sam pie registers are synchronized to thi 
alignm entofa signaltransition atthe fa rend ofthe transm ission line 
through thee cnep \e tr r Ad dinga delaym argin for variation in the receiver 
m argin necessary for clean sign alrecep tio n ,th e varia bilta odnesl a y c a n be 
alw ays arrive atthe fa rend ofthe transm ission line ata w elide fin ed ti 
other piece ofinform ation w hich w e getfrom this con figu ration is th 
re qu ires for a bitto travelacross a piece o fin terco n n ect.Th is inform a 
the routingco ma p o m e n t f i r the pipeline delaysiatfc isnogebi kttseaicwoissh tran 
the associated i n t e r c o n n e c t (Se e Section 4.11.3). 

6.9.4 Simulating Long Sample Registers 

The discussion in the previous section assum ed the sam pie regis 
actuallyrecord sam plesuntilsom etim e ClftGS26t Hee wdflettffaoh aediisrn e d . 
0.8/U process, in verterdelays run about M)G ]h ss taom2(p0 pes bThiissenabled rou£ 
200 p s t o 400 ps apart. Each nanosecond ofw ire, or about 15 cm ofw ire 
sam pie bits in the sam pie register.Actu allyb u ildinga longsam pie reg 



1 Ac tu ally, in an ideal settingthe im p ^21-fa VSj-JSo Gsa4n24E'(e, as h i gh a 



133 



the range ofw ire lengths ofinterestw ould be im practical. Ho w e ve r , w 
registerbydelayingthe sam pie pulse into a short sam pie registerbya 
ifw e can delaythe pulse into the sam pie registerbym ultiplesofthe s 
slide the sam pie registerforw ard in the tim e sequence.Byperform in 
w ith varyingoffsets forthe sam pie registerpulse and recordingthe san 
transition, w e can virtuallyreconstructthe w aveform w hich a longsa 

Figure 6.31 show s a sim pie sam pie register architecture for sim ulai 
in this m anner.The enable pulse ripples through the in verter chain a 
pulse reaches the end ofthe in verter chain it is optionallyrecycled 1 
pulse finishes cyclingthrough the inverters the con figu red num ber oi 
w ill contain the values recorded duringthe lastcycle. Care,ofcours 
the recycle path and in reconstructingthe w aveform .If the recycle pa 
delaybetw een the lastsam pie bitin one cycle and the firstsam pie bi 
notbe identicalto the inter -sam pie bitdelayfor bits enabled duringt 
is sm a 1 1 , i t m ayonlym ake the sam pie granularityslightlycoarser.Figu 
thatrecycles the sam pie bitbefore com pletinga cycle.As a resultofth 
alltransition can be pin-pointed to inter -sam pie bittim es.The w a ve f 
the overlappingsam pies to m ore accuratelym im ic a single longsam 

The m axim um operating frequencyofthe counter and com paratorw 
ofa sam pie register w e can use in this schem e. The sam pie register 
longto allow the com parator atrl ad acrodi h ifcerr rlaajgiacr eod sf© rthe nextenabl 
Ifw e assum e a delayofatleast 100 ps through each inve ntue tearn d w e a s 
w e need a sam pie registerw hich is a t 1 e a s t 25 i n ve r t e r -p a i r s longforp 

6.10 Summary 

In this chapterw e addressed the issue ofhigh osnpeenetd .sWgniad Id hn- g b e t v 
tified the problem oftransm ittin gtp ict si b e ttw aeseai p; o iint t rt gepocrri nttransm 
line signallingproblem . We saw thatalow -vo ltage sw in g, series -term i 
nallingschem e provided the low -latencysignallingw e desired w hile 
low .In order to address the issue ofterm ination m atching, w e intro 
end ofthe series term inated transm ission line. This allow ed us to 
im pedance to the characteristic line im pedance,com pensatingfor 
vironm ent.Finally, w e show ed thatthe basic m atchingm echanism s 
the delayalignm entnecessaryto reliablypipeline data across w ires 

6.11 Areas to Explore 

In this chapter, w e have detailed a m atchingschem e thatuses digit 
m atched im pedance. It w ould also be p oescsd bvleer fcotaasleerreotltra phle %h e a: 
the im pedance selection w as high orlow .Such a schem e m ayrequir 
m ore am enable to autom atic,on-chip calibration. 
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To sim ulate a largersam pie register,w e allow the enable pulse to be r 
sam pie registerinverterchain.Varyingthe num berofcycles w hich the e 
through the in verter chain, allow s us to m ove the w indow oftim e reco 
r e gi s t e r . We can com bine the sam pies recorded ateach recycle con figu l 
the w aveform seen bythe inputpad overa large period oftim e. 

Fi gu re 6.31: Sam pie Registerw ith Recycle Op t i o n 
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To gain higher accuracyw hen reconstructinga w aveform from m any 
w indow s,w e can recycle the enable before the end ofsam pie registe 
registerw indow sew ci hi © tshreln.p 

Fi gu re 6.32: Sam pie Registerw ith Ove rlapped Recycle 
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The extensions necessaryto allow delaym atchinghave not, as yet 
tested . No doubt, w e stand to learn m ore aboutthis problem from su 
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7. Packaging Technology 



Wh en w e actuallybuild anysystem ,w e m ustph yspi o a fel y p sa. c\fe a ge its p 
m ustprovide a physicalm edium forthe w ires interconnectingcom 
a m echanical support substrate to organize and house the com pon 
packaginga netw ork w ill directlyaffect the size ofthe packaged net 
connection distances and transit latencybetween components. In 
packagingtechnologies and develop hierarchical schem es forpack 
in Chapter3. The packagingschem e deve loped here m akes use ofall 
m inim ize interconnection distances. 

7.1 Packaging Requirements 

Wh en packaginga netw ork w e have m any, o ft en conflicting, goals. We 

• Minim ize the interconnectdT t f)tances (and hence 

• Provide controlled -im pedance signalpaths (Ch a p t e r 6) 

• Supplyadequate pow erto all com ponents 

• Facilitate a jinocuhsr clock distribution to all co m ponents 

• Coolcom ponents byrem ovingthe heatgenerated by ICs duringope 

• Facilitate physicalrepair 

• Minim izepackagingcost 

To keep the interconnectdistances short, w e seek dense packagings 
as close as possible. Exc essive density, how ever,m akes supplyingpc 
physicallyrepairingfaults di ffic ult. High-perform ance packagingand i 
the m ostexpensive partofa system to m anufacture and assem ble.A 
strategy, w e m ustkeep in m ind the econom i c s ofthe available techn 

7.2 Packing and Interconnect Technology Review 

7.2.1 Integrated Circuit Packaging 

Conventionalpackagingtechnolo gyls taorrt si iwt etgr sptae dkcaigasdi Bts as the b 
levelbuildingblock. Silicon ICs are diced f r o na c fehde ifa rKrpcaactk (a ge w . a fe r 
Fine pitch w ires are bonded from pads around the peripheryofthe die 
shelves alongthe perim eter ofthe die cavityin the ICpackage.The pac 
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and bondingw ires from the environm ent.Dicingand packagingallo\ 
com ponents b lyi fu/ n n tii cs p a e d . Th e packaged ICcan be m ore easilyhan 
and assem blythan the bare die.Packaged ICs can be replaced as defe< 

Today, m ostlCpackages are plastic encapsulants,ceram i c , o r fin e -1 
Plastic packages are inexpensive,but onlyallow connections aroun 
lim ited capacity for heat rem oval. Ceram ic packages are m uch m o 
herm etic sealing for the die cavityand allow gr eater heat trans fer fr 
package.High pow erICpackages provide paths ofhigh therm alcondu 
package w here a heat sink m aybe m ounted to disperse the heatrei 
circuitboard ICpackages uhtnloieo tgbieessiainii s et d n PCB production and can 
experience, tools, and m anu fa cturingdeve loped for fin e -line PCBs . Ce r a 
packages can supporti/o connections coen "Eh bs sgi ©ef 4 rripspftoillk^e surf 
pin-grid array (PGA) arrangem ent. 

Wh ether the pins are arranged around the peripheryofthe ICfora pi 
in a gridded fashion,the package size is gen erallyd e term ined bythe c 
size and achievable densityofexternali/o connections. As a result, a ] 
than the housed die.The size ofan ICpackage m ayonlybe correlated t < 
because both the package size and the die size are often directlydete 
pins on the com ponent. 

7.2.2 Printed-Circuit Boards 

Pa c k a ge d ICs are assem bled on printed -circuitboards (PCBs ). Th e s e b 
support for the ICs and provide the first level o finterco n nectio n am c 
Conventionalprinted-circuitboardtechnologya PCBsw Miktri p ha anufactu 
layers ofetched copperprovide interconnectin tw o-dim ensionalpl 
are separated bylayers ofinsulatingm aterials.Drilledoaimcbptlated hoi 
am ongthe tw o-dim ensional etched co pi pi 4a jHeSfi/eRasc 4e a gh dn IQs s i n gl e m 
m aybe located on one orboth sides ofa com posite PCB. 

Som e packaged ICs have pins w hich can be inserted in m a tin gh o li 
via a socket. Du ringassem bly, solder is used to conne cPCBh e com pone 
Com ponentpackages w ith pi Ihesdr a tjli tih as hvoaly fchdr d u gh the PCB and are 1 
term thrdugh-hole com ponents. An other com m on form ofpackaged ICs h a 
which can be soldered to exposed m etallands on a surface ofthe P( 
w hich siton the surface ofthe PCB and solder to PCB lBQB,dasr w ithoutprc 
c a 1 \$utface-mount com ponera tse. $ni mf t com ponents onlyconsum e space on 
ofthe PCB connected to the IC, w hereas through-hole com ponents con 
the PCB include the oppaxssng PCB surf 

Acom m on discipline w hen designingprinted-circuitboardsfordi 
dedicate a pow erplane fo reach pow e re.]ge NSrlorie rf$J£,€/$ gr Mi). thesystem ( 
Th is discipline allow spow erdistribution w ith m inim alresistive lo; 
and the com ponents. Dedicated pow erplanes also provide low indu 
supplies and the packaged ICs 'pow er leads. So lid conductor plane 
reduce the cross -talk betw een s PglB-aTfot ga a e s, notn et hceo siasms tent control' 
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im pedance interconnect, si gnaltracaasn adriecgeonreprlaalftyer o n bo eve w ae c n a p 
ofsolid conducting pVB84])e s (Se e [R 

As a practicalm atter, there is a lim itto the system size w e can plac 
board. De spite the fact that PCBs are ge nltf irpa M jicd oimtp<D si a de d sf ,nPGB 
technologyis essentiallytw o-dim ensional. The size ofa single printi 
bym echanical stability, yi eld, and m anufacturingconstraints. To da 
m axim um viable side length for PCB tech n o lo gy. In f a c t , f o r m anufactu: 
m anufacturerslim itone PCB side dim ension to less than 14 inches. For 
yield considerations w illgenerallyprovide m ore severe lim its on the 

To d a y, PCB features dow n to 8m ils (1000 m ils =linch)are considerei 
m anufacturers can produce features dow n to 3or4m ils ,b u t th e over; 
m anufacturingcosts are roughlyproportional to the BQBm ber oflaye 
Be low the feature sizes used in volum e production, the costincrease 
Wh en dealingw ith m uht ilea tye giRGB,tcecc stis also dependenton the variety 
holes re quired. 

7.2.3 Multiple PCB Systems 

Wh en a system design exceeds the size which can be e ffic ientlypla 
circuit board, the system m ust tPCBb tin ltte ftcaa mc tie ah Miap teonnectors an< 
cables. Bo ards interconnected via a backplane PCB is,byfar,thedom in 
interconnect, today. In this case, one PCB is used to interconnectm an 
on the "backplane "board allow other boards to m ate orthogonally 
produces a structure w hich takes som e advantage ofthree-dim ensic 

Wh en a system exceeds the size practical to build in backplane fa 
physical,orm echanicalrequirem entslim itbackplane use, portions 
via cabling. Aga in,connectorson each printed -circuitdbrorao'dtprovide a i 
Ratherthan directlyattachingtw o orm ore boards, a cable o fin su la te 
connectthe boards. 

Cables for controlled -im pedance interconnects com e in three pri 

1. Ribbon cables 

2. Coaxialcables 

3. Flexible printed -circuitcables 

Ribbon cables are com posed ofa s a quhe siecpeacr fidodidbiyano irns su latin gm 
rial. Fl at -ribbon cables can be used in a m aancrcffirpWarb ilcEhc gem ter callbydp-r 
im pedance interconnect. Coaxialcables place a conductor inside 
cables have m ore stable im pedance characteristics, but are often bi 
the alternatives. Flexible printed -circuits use the w ell established F 
lam inates. Typ ically, flexible printed -circuit cables achieve controllei 
overground plane ti d> ipa <d Fitl^y fa m 
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7.2.4 Connectors 

To d ate ,m ostboard -to -board, board -to -cable, and board -to-packag 
usingpin-and-socketconnectors. On e board, cable, or IC has a connec 
row s ofpins.The m atingunithas a connector w hich houses a setof 
geom etry. The tw o p henceecs t a rie bey m atingthe pin and socketconnectior 

Mo re recently, a num bero ncnoem: tporres shs ao/en rcecom e available. On e c o r 
connectors can m ate directlyw ith lands on tw o PCBs or packages. V\ 
nectorprovideselectricalinterconnectbetw een the lands. Co m pre 
characteristics w hich m ake them preferable to pin-and-socketcon 

1 . Higher density 

2. Superior signalintegrity 

3. Low er insertion force 

4. Function w ithoutsolder 

Pin-and-socketconnectors are lim ited bythe achievable density ford 
com pressionalconnectors are lim ited bythe area required to carrysi 
ofseparate conductors. Rem ovingthe need for solder m akes assem 
insertion force required forinsertingpin -and -socket connectors is p 
pinsonthecom ponent.Asthenum berofi/opinsincr e2(a0a- e,sodoesth 
p i n PGAs are alreadyexperiencingexcessivecin si eerct tonrfbaraaifafo trie n b lgsc 
to m ove to m ore com plicated sockettingschem es w hich m ate pins 
outthe required insertion force.Traditionalpin-and-socketconnect< 
controlled -im pedance paths, w hile m anyofthe em ergingcom press 
signalpaths. 

Severaltechnologiescurrentlyavailable forcom pressionalinterco 

• An isotropic conductive elastom er 

• "Bu 1 1 o n balls" 

• Sp r i n gs 

Severalm anufacturersnow produce strips or sheets ofelastom erw 

conductors are arranged to conduct onlyalongone axis. In this w ay, 

betw een co nadcie cdOopnp qoiite sidesoftheconnectorand lined up alongt 

The elastom erw ill com press underpressure all ovw a n egct hrei ocaoln duct 

c o n t w.g. t Rn c 90] [Po 1 90] [Te c 88] [ND90]). "Bu tton b a 1 1 s " a r e^s p mnpgo i & dv oi f 25 

com pressed into sm alldiam egg.t e>§ m yilM d darriic a t ehrobl y40(n ilhigh)in a plas 

carrier. Theyprovide m ultiple points ofcontaatb esW ee eft hhee bPCBll a n ■ 

w hen com p.g.e [£>i «8Q] [Sm o 85]). Sp r i n ges nyAee ctors hou£s;sha>]£mdepail 

w hich behave as springs in a flexible carrier. Wh en com pressed betw > 

ofthe m etalforces positi vBGBs> n nahc* Whi$Ihdt(tossicnfet hetip cr[(j;PM92] 

[Co r 90]). 
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Shown here is a cross -sectional view ofa pac IPOI^ nag' stack. Com ponen 
sandw iched in alte rn a tin glayers to form a three -dim ensionalstack str 
This stack structure serves as the nextlevelofthe packaginghierarchy 
form s the basic buildingblock outofw hich largersystem s can be buil 

Figure 7.1: Stack Structure forThree-dim ensionalPackaging 

7.3 Stack Packaging Strategy 

Leveragingm ostlyconventionalpackagingtechnologies,w e can pac 
utilize a 11 th re e spatialdim ension. We continue to use fairlyconventi 
technologybut stack c (PGfisp iai ntth re t d iarm dnsion orthogonalto the PCB p 
f o r m stack structure sandw ichinglayers ofpackaged ICs b e t w e e n PCB 1 a y 
Com pressionalboard-to-package connectors provide signalcontinu 
printed -circuit boards and integrated -circuit com ponents. This sta 
packed three -dim ensional cube ofcom ponents and interconnect 
buildingblock foreven largernetw orksand system s. 

7.3.1 Dual-Sided Pad-Grid Arrays 

Wh ile there is noveltyin ourdesign and use ofthe ICpackage,the bas 
w e em ployis conventional. The integrated -circuit is housed in a pa 
ofcontacts. Rather than beingpins,the contacts are land grids sim : 
for attaching sou ur fe to ce orm ponents. These land grids are connected th n 
connectors to sim ilar land grids on the PCBs . Du e to the low -insertic 
available from these land -gr id arr ae yse(fxGAy b, fehceoynh ae ve nr attractive optio 
packaginghigh pin e-.<g.o fBn C 89$). (In t e 1 , f o r e xa mdpolp t,b(h at ha e LGA p a c k a ge 
fo r its 80386SL m icroprocessor [Ma 1 91]. 

We m ake one fin al addition to the LGA structure. Rather than placin 
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the bottom side ofthe package,w e place the land -gr id arrayon both 
optionally, pro vide continuitythrough the package betw een the top a 
re su ltin gst rduad-tsidaidspaxl-grid array (DSPGA) to em phasize the factahadpads are p 
on both sides ofthe package.Each verticalpad paircan be con figu red 

1. The correspondingpads on the top and the bottom ofthe package 
through w ithoutconnectingto an ICpin to support verticalinterc 

2. The correspondingpads can be connected togetherand to an ICp 
m ake contactto traces on eitherorboth ofthe boards above and 

3. The correspondingpads can be connected to di ffe rent IC pins to s 

No tconnectingthe correspondingtop and bottom pads as in (3)requi 
ufacture and w ill m ake the package m ore expensive than ifonlycoi 
used. 

Fi gu r e 7.2 s h o w s DSPGA372, a 372 p a d DSPGA w e h a ve deveulppe d tsDSPGA372 s 
160 IC si gnal connections, 76 through signals which do not connect to 
supplies supported by 72, 40, and 24p ad s ,re sp e c tive ly. Alllandsare 30 m i 
around 10 m i 1 plated holes. Contacts a rid i g^ <t d> p DfiBQABTB) h disi gh n o b 
three internalpow erplanes forprovidinga low -resistance and low -i 
and the externalpow ersupplies.The nom inalground plane is suppo 
planes supplythe logi (V/p^-o, avn edr tshuepspi gy, all iunpgp ¥^^ a e Ad ditionally, 
space is provided in the p a <d ki angfe bfcypr as susr feacpe a-ni itors across the pow e 
The 76through pins allow the package to supplytiicre MC@s foirterconnect 
signals which do not connect to the IC. The rem aining 160 pads suppo 
Each ofthese 160signals is available on both the top and the bottom o 
Table 7.1 s a m m arizes the physical dim ensions ofour DSPGA372 p a c k a g 
pictures ofthe package. DSPGA372 w as fabricated bylbiden using BT (Bi s l 
as a seven-layerprinted-circuitboard. 

Th e DSPGA372 package has dedicated coolingand alignm entholes.The 
be used to align the package to the com pressionalconnectorand a 
The inner holes open into the heatsink cavityunderneath the die all 
flow across the heat -sink forheatrem oval. 

7.3.2 Compressional Board-to-Package Connectors 

Pa c k a ge d ICs are m ated P(EBist,hb a> d ha a b d tve and below ,through com pi 
connectors. These connectors provide through contact betw een tl 
PCB. Us in gself-align in gco nop neescstkorns ario solderis needed to m ake reli; 
Pro p e rlyse lee te d com pressionalconnectors w illprovide consisten 
asrequired for hi gh -speed si gn ailing. 

Fi gu re 7.4 show s a picture of BB372, a com pre snsri e n taol r bdve tst bgn ebdo laor d c 
m ate w i t h DSPGA372. rBB3v72s e s 372 b u 1 1 o n s a 1 i gn e d w i tIfeSR(BA37Ea n d gr i d s on 
The buttons used b jfi B>Bu>7s2eadr m.40 flOil cylinders. Figure 7.5 show s a close 
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He a t S in k 

DSPGA372 i s a 372 pad dual -sided pad-grid q rirm ^.cMd $ a drsa a gte t ct h r o u gh 

betw een the top and bottom ofthe package.76pads do notconnectto 

existsim plyto provide through interconnect. (Ar tw ork byFred Dr e n c k h £ 

Fi gu r e 7.2: DSPGA372 

picture ofa button ino hhnee <BB37r2. cThe buttons provide low -resistance 
im pedance interconnect. Th eocnemet e 1 a>rf t h e jBB3f72tco accom m odate the 
or lid associated w ith the m ating DSPGA372. BB372 is 30 m ils thick allow 
or lid attached to DSPGA372 to extend atm ost30m ils verticallyabove or 
Com plem entaryholes are provided for coolant flow to m atch those c 
has tw o stubs at opposite corn eIP)SP(3Afi72cali i gn ait e rwt ill b ltdise. Th e stubs 
protrude on both sides ofthe carrier, allow ingthe carrier to m ake 
the attached PCB and DSPGA p a c k a ge . Th e BB372 carrier is madefrom Vec 
Po 1 ym e r [Co r 89] and w as fabricated byCinch.Table 7.2 sum m arizes the p 
Ou r m ain disappo iBB832 b a * \b a fehn the handlingcare required. The f 
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Ta b 1 e 7.1: DSPGA372 Ph ys i c a 1 Di m e n s i o n s 



To p (die c a vi t y) 



Bottom (heat sink) 





Pictures of DSPGA372 show n actualsize 

Figu re 7.3: DSPGA372 Ph o to s 

com posingthe buttons can easilybe pulled outofthe cylindricalhol 
the buttons, the w ires often attach to the ridges on the person 's finger 
the fingers m ove aw ayfrom the connector. As a result, the connectorw 
im properly. Wi th proper equipm ent,the buttons can be restuffed.se 
inserted into a system and com pressed, the buttons rem ain situated 

Initialexperim emrtsdwdthvaelastom er frj9»2Jis uFggji p tot hytHlue lastom eric 
technologyis a viable alternative forthis application. El astom eric co 
hum an handling. The elastom erprovides a sheetofanisotropic conti 
isrequired to m atch the pad geom etryofthe package or PCB. Si ze and si 
custom ization required fo reach application. As a result NRE costs are 
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Picture of BB372 show n actual si ze 



Fi gu r e 7.4: BB372 



Feature 


Size 
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Co o 1 i n g Ho 1 e 
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Ta b 1 e 7.2: BB372 PhysicalDim ensions 

us slightlym ore freedom to choose the connector height. As a side b 
to gasketthe coolant fo reed through the coolingholes. On the negati\ 
low ercurrentcapacityand higherresistance than the button -board 



7.3.3 Printed Circuit Boards 

Printed-circuitboardsaresandw iched betw een com po nen t layers 
Th e s e PCBs are fairlycon fvtei hat y© m, a; b m tvj o lie cPGBh . p'Hidea enney special 
accom m odations required are the land grids and alignm entand co 
w i t h the DSPGA packages.The PCB lands m im ic the geom etryofthe DSPG^ 
nobilitycontact, tahcee BOB is lidr fee go Id plated. Alignm entholes allow cor 
a s t h e BB372, to align w ith the PCB land pattern. Co oil htgtt © teosoalnen t e qu i r 
flow through the coo Ian t ho les pro vided in the connector and ICpack 
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Closeup ofbutton on BB372 



Fi gu re 7.5: Bu 1 1 o n 
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In side the stack, each printed -circuit boa ndt ansa \k tpeas owe irt In st iwt eo, c o m 
one above the PCB and one below it. Mo stofthe pins on the com pon 
PCB should notbe connected to the co nrceesrptcp:nodaietirgpd mstb b (thpqpaodjte 
side ofthe PCB. Th e PCB m ustnotprovide contiBacthyhDrftUsesoirflhcespads 
w hich occupythe sam e planar location. Th is re quirem entcan be s a 
m anufacturingtechnologybyeitherusingvias which onlyconnectin 
routinglayerson the PCBi,H> gtrhjeovtifa s ass oecai a heka w di teho theydo notinterse 
In som e case ,co ntinu ityb etw een correspondingpinson the com pon 
is desirable. Pow er si gnals, busses, and globalsignallines are com m < 

7.3.4 Assembly 

Astack is assem bled bybuildinognpdayeir$ pafrPQBp ,acc k a ge d ICs . Fi gu r e 
depicts the com position ofa typical stack. Figure 7.6 show s a m ore c 
com ponent stack. Figu re 7.7 show s a close-up cross -section ofan ass 
placed bol hsottti Ir ohueg stack provide verticalcom pressive force and pre 
board alignm ent. Arigid m etal plate at the top and the bottom ofth 
com pressive force across the stack. The alignm entpins and holes pr 
carriers, and packaged com ponents. Th eBB3I7i2gai Irhoeenascfhi ib si ptconvi d e d i i 
board to align to tBSPGAiajaiaie BOB independently. Ap a n e $iU & tr, c a)lrhgn 
ateverystage. Alignm ent tolerances from layer toei ac^B3"72 rse notadditi 
30 m i 1 s thick and the m ati ©§PjGA3"12l ios n8Gbnf eias;hhick,the space betw een P 
in an assem bled stack usingthese com ponents is 140 m i 1 s . 

7.3.5 Cooling 

On ce stacked, the coolant ho les in the DSPGA packages, carriers, and 
vertical coolingchannels through the stack struct uB&PGATh e heat sin 
com ponemdasnftdlcj the four coolin gch annelpai snseonctLaTteed cwh atih ni esl s: o n 
allow forced airorliquid coolantto be circulated through the stack a 
proper heat sink design,the coolant flow ingacross a com ponent be 
experience turbulent flu id flow to effecte ffic ient heat trans ferfrom th 
All coolant channels can be 1 e ft open allow ingparallel no w across 
colum n.Alternately, everyother coolant hole can be plu gge d f o r c i n g ; 
sinks in a coolantcolum n. 

7.3.6 Repair 

Co m pon eanctermpeln tis sim plified bey nhnee scdilcdie s 1 eTe sreplace a faultyco 
pon eHCB, orocnnector,w e sim plyneed to disassenkb ibewt h egos (bade k , s u b 
replacem entforthe faultyunit,and re-assennh hee tht eosnt a ©lb .vTh fee stch led e 
need to desoldercom ponen tPOtsn. dDfrceowu of s k , fmoewl ienrem nunset (b tes d i s c 
from the stack and coolantdrained before the stack can be disassen 
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Aluminum plate 




manifold 



_H=h- 



horizontal board 



horizontal clock driver 



spacer 
window frame 



horizontal board 



horizontal board 



horizontal board 



J- 



Aluminum plate 



manifold 



(1v, 5v, gnd) 

Bus Bar 



n 



debug connector 



Show n above is an enlarged cross -section ofa netw ork com ponei 
(Di agram courtesyofFred Drenckhahn) 



Fi gu re 7.6: Cross -section ofRoutingStack 



7.3.7 Clocking 



An ysynchronous system re qu ires thatclocks be distributed to a 11 c 
the com ponents see the clock edges at appro xim atelythe sam e tim t 
arrival tim es is k ir/ocfe/sfeHaasntdi ,ege nerally, acts to lim itthe clock rate by 
setup and hold tim es.The clock distribution problem in the stack i 
problem ofclock distribution on anylarge PCB or m u 1 1 i -PCB s ys t e m . ( 
distribution trees and low -skew clock buffers can help m inim ize thi 
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Show n here is a close-up picture ofa m ated sepo h&BSX&a n d DSPGA372 c o 
which has been cut at an oblique angle to expose the topologyofthe m 
The stack show n above is com posed ofa BB372, DSPGA372, BB372, and DSPGA37 
sandw iched inside a plastic encapsulant. The encapsulant serves to 
togetherafterthe cross -sectionalcutw as m ade. 

Fi gu re 7.7: Close-up Cross -section of Ma ted BB312 and rOSPGA372 Co m 

For short stacks in w hich the propagation delay verticallythrough 
ponentsis sm a 1 1 , i t m aybe su ffic i e n t t o c a eeafo 111 f 6 isitmi h iDtne ©lnee clhoye k 
o f PCB (Se e Fi gu r e 7.8) (b hi aire ctthe clock signals verticallythrough each C( 
stacks, the propagation delaythrough the colum n m aybe intolerable 
through the vertical interconnects m aynotbe su ffic ientlycontrolled 
tion.Alternately, w e can use a tw o-tierfanoutschem e.Each PCB 1 a ye r s 
identicalclock fan ou t.sim ilarto the single layerfanout.The inputto 
from another fa nouttree through carefullytuned lengthsofcontrolle 
flexible printed -circuitcables.The additionalstage offanoutadds son 

An otheroption is to provide a directconnection to each clocked K 
the edge arrivaltim e is carefulljli tiant em& l^itrin i92]k Qn d ofclock distribut 
sim ilarto the m atched delaydrivers described in Section 6.9. Ho w e ve 
clock skew m ake ita m uch m ore di ffic ultproblem . 

7.3.8 Stack Packaging of Non-DSPGA Components 

As described so fa r, the packagingschem e re qu ires all ICs be p a c k a 
The netw orks described in this docum entare builtoutofhom ogen 



149 




Clocked IC 



Buffer 



Primary Clock 

Sh ow n ab ove is a represent at ive clock fanout sch eme. Th e t race lengt h s in all clock runs 
sh ould b e b alanced so as t o guarant ee as lit t le skew b et w een clock edges as possib le. 

Figure 7.8: Sample Clock Fanout on Horizont al PCB 

long as w e can package our rout ing component in DSPGA packages, the ent ire net w ork can b e 
easily const met ed as describ ed. It is, nonet h eless, w ort h w h ile t o consider h ow t o accommodat e 
ot h er component s in t h e st ack. Th e net w ork endpoint s, for example, const it ut e component s ot h er 
th an rout ers, andw e may not h aveth e freedom t o package all such component sin DSPGA packages. 

Th e st ack st ruct ure w ill readily accommodat e lowile-jBamponent s in t h e spaces b et w een 
DSPGA component s. As not ed, using DSPGA372 and BB372 component s t h ere is 140 mils of 
clearance b et w een PCB layers. ICs w h icfit eamnfort ab ly w it h in t h is h eigh t can b e accommo- 
dat ed in t h e st ack. Th e h eigh t requirement precludes almost all t h rough -h ole component s including 
PGAs. Th rough -h ole component s furt h er complicat e t h e mat t er since t h eir pins generally ext end 
t h rough the PCB and int o t h e space b elow t h e at t ach ed PCB. Most leadless ch ip carriers (LCC) and 
gull-w ing surface-mount component s are around 100 mils t h ick and can easily b e accommodat ed. 
J-lead surface-mount component s are generally t h icker and leave infirfent clearance. Smaller 
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surface mount component s such as TSOPs are easily accommodat ed and may b e t h in enough t o 
allow component s t o b e mount ed on b ot h sides of adjacent b oards. Of course, the non-DSPGA 
component s only h ave direct access t o signals on t h e PCB t o w h ich t h ey are mount ed. Such 
component s must make use of t h e spare, t h rough connect ions provided b y DSPGA packages w h en 
t h ey require vert ical int erconnect . Th e non-DSPGA packages are not part of t h e assemb led st ack 
cooling ch annels. Cooling for t h ese component s is limit ed t o h orizont al forced-air b et w een PCB 
layers. 

7.4 Network Packaging Example 

For the sake of concret eness, let us consider h ow w e package a small mult ist age, mult ipat h 
net w ork. Figure 7.9 depict s a mapping of a mult ist age net w ork int o a st ack package. Each st age of 
rout ers is assigned it s ow n plane in t h e st ack. Th e rout ers are dist rib ut ed evenly in b ot h dimensions 
w it h in t h e plane. Th e PCB b et w een planes of rout ing component s implement s t h e int erconnect 
b et w een adjacent stages of rout ing components. Since th dri(Jtr^ rout ers in each stage, 
dist rib ut ed in t w o dimensions, each siO^i^iV) long, making t h ew ire length s b et w een st ages 
&(y/N) long. Th e t ransit lat ency grow t h for t h is st ruct ure w ill t h us mat ch our expect at ions from 
Sect ion 3.1.5. If t h e input s and out put s are not all segregat ed t o opposit e sides of t h e net w ork as 
sh ow n in tigere, it w ill b e necessary t o run the input and out put connect ions w h ich originat e on 
t h e w rong side of t h e packaged net w ork vert ically t h rough the net w ork layers t o connect the input s 
or out put s int o t h e net w ork. Th ese loop-t h rough connect ions are one class of signals w h ich use the 
st raigh t -t h rough int erconnect provided b y t h e DSPGA packages. 

7.5 Packaging Large Systems 

7.5.1 Single Stack Limitations 

Unfort unat ely, t h ere is a limit t o t h e size of our st acks and h ence the size of net w ork w h ich w e 
can b uild in a single st ack package. Recall from Sect ion 7.2.2, our PCB size is limit ed somew h ere 
under 30 inch es. fine-line t ech nology w e use. Vert ical layers are relat ively t h in. Consequent ly, 
if w e package the layers as suggest ed in t h e previous sect ion, w e normally do not run int o any 
ph ysical const raint s in t h e vert ical packaging dimensions. For example, a t ypical PCB t h ickness 
for t h eh orizont al PCB w ould b e 100 mils. Sect ion 7.3.4 not ed t h at using DPGA372 and BB372 
component s, t h e space b et w een PCBs is 140 mils. In t h is scenario, each addit ional net w ork layer 
w ill increase t h e st ack h eigh t b y just under 0.25 inch es. Since the PCB side size is increasing as 
®(y/N) and the numb er of st ages, and h ence h eigh t , is increaQf^pg^iV)), w e encount er t h e 
PCB size limit at ions b efore any vert ical const raint s. 

Nonet h eless, the vert ical const raint s t h at may arise are most ly dominat ed b y cooling and signal 
int egrit y const raint s. As t h e numb er of component s in a vert ical column increases, in a parallel 
cooling sch eme w e w ill require great er pressurefhrad -rat es t o cool the component s. Similarly, 
in a serial cooling sch eme, t h e t emperat ure gradient b et w een inlet and out let w ill increase. As 
not ed in Sect ion 7.3.7, for siffcient ly t all st acks w e cannot rely on vert ical column int erconnect 
for h igh -speed, low -skew , glob al signal dist rib ut ion. 
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Logical Diagram of 
Simplifed Network 

From Processing Nodes 
Net w ork Input s 

yyww 



/\/\/\/\/\/\/\/\ 

Net w ork Out put s 



Physical Network 
Construction 

From Processing Nodes 



print ed circuit 
b oard 




Rout ing 
St ack 



I 



rout ing 
component 



To Processing Nodes 



To Processing Nodes 



(Net w orks not draw n t o scale.) 



[not e: rout ing component s 
are dist rib ut ed evenly 
in b ot h dimensions 
across each plane.] 



Th e diagram ab ove depict s h ow a logical st ack is mapped int o t h e st ack st ruct ure. Th e 
int erconnect b et w een each pair of rout ing st ages is implement ed as a PCB in t h e st ack. Each 
st age of rout ing component s b ecomes a layer of rout ing component s packaged in DSPGA 
packages. 

Figure 7.9: Mapping of Net w ork Logical St ruct ure ont o Ph ysical St ack Packaging 

7.5.2 Large-Scale Packaging Goals 

To b uild large net w orks, w e seek t w o t h ings: 

1. A net w ork st ack primit ive w h ich represent s a logical port ion of t h e net w ork and can b e 
replicat ed t o realize the connect ivit y associat ed w it h t h e t arget net w ork 

2. A t opology for packaging and int erconnect ing t h ese primit ives 

As developed in Ch apt er 3, for large mach ines w e focus on fat -t ree net w orks. Our prob lem is 
finding a decomposit ion of t h e fat t ree int o represent at ive sub -net w orks w h ich can b e implement ed 
in a single st ack st ruct ure. We desire h omogeneit y in st ack primit ives for t h e same reasons w e do 
in int egrat ed-circuit component s (See Sect ion 2.7.2). Wh en select ing a packaging infrast ruct ure 
for assemb ling the primit ive st acks, w e must address the same general packaging issues raised in 
Sect ion 7.1. 
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7.5.3 Fat-tree Building Blocks 

Recall from Sect ion 3.5.6 t h at w e can t h ink of a fat -t ree, mult ist age net w ork as composed of 
th ree part s: 

1 . a down net w ork w h ich recursively sort s connect ions as t h ey h ead from the root t ow ard the 
leaves 

2. an up net w ork t o rout e connect ions upw ard t ow ard the root 

3. lat eral crossovers w h ich allow a connect ion t o ch ange from t h e up net w ork t o t h e dow n 
net w ork w h en it h as reach ed t h e least common ancest or of t h e source and dest inat ion nodes 

Using radix-r, dilat ion-d crossb ar rout ers, web mld-emy dow n t ree sort ing net w ork much like a 

flat , mult ist age net w ork. Rout ers in t h e upw ard pat h allow a connect ion t o connect int o one of t h e 

next (r — 1) dow nw ard rout ing st ages or cont inue rout ing upw ard. Th e upw ard rout ers compose 

b ot h t h e up net w ork and the crossover connect ions. V^k evii)yk>gical t ree levels, w e h ave 

one upw ard rout ing st age. 

We can collect t h (er — 1 ) dow nw ard rout ing st age, the associat ed upw ard rout ing st age, and 
the crossover connect ions int o a ph ysical t ree level. Each such ph ysical t ree level encompasses 
(r — 1 ) levels of t h e original t ree. Taking oi^r — 1 ) dow n t ree st ages, t h e t ot al sort ing performed 
b y a ph ysical t ree leveln^ as given in Equat ion 7.1. 



r 



t^ 1 ) n.v 



Th e size of t h e logical node at each ph ysical t ree level w ill increase as w e h ead t ow ards the root 
since t h e b andw idt h at each t ree st age increases t ow ard the root . As a result , w e need t o furt h er 
decompose each ph ysical t ree level int o primit ive unit s w h ich can b e assemb led t o service the 
varying b andw idt h requirement s at each t ree st age. 

We use t h e t erwnit tree t o refer t o any primit ive st ack st ruct ure w h ich implenferelds a 
b andw idth slice of each ph ysical t ree level. Th ere is a large class of unit trees b asedonth e parameters 
of t h e rout er and packaging t ech nology. Tab le 7.3 summarizes the paramet ers associat ed w it h a 
unit t ree st ack. Th e rout er paramet ers?, and w h ave b een discussed in det ail in Ch apt ers 3 and 4. 
At t h '&> ot t dfrof the unit t ree, the numb er of ch annels h eaded t o and from ph ysical t ree levels 
closer t o t h e leaves is denot ed c\ is a mult iple of t h e rout er dilat i<a&i,w h ich det ermines the 
size of t h e b andw idt h slice h andled b y t h e unit t7rae.dQiBranined,c; can b e ch osen such 
t h at t h e size of t h e unit t ree is accommodat ed in a single packaging st ack. One generally; w ant s 
large for increased fault t olerance and resource sh aring. Th e availab le packaging t ech nology w ill 
limit the size of any st acks and, h ence, lkriitln t h ese respect St is much like the rout er dilat ion, 
d; w e generally select as large a value as w e can afford given our ph ysical and packaging limit s. 
At t h e leaves of a t ree, w e need t o connect the processors int o t h e t ree, and h ence w e need a unit 
t ree w it h ch annel capacit ies mat ch ed t o t h e input and out put ch annel capacit y of. each node ( 
a = ni = no). Th e ch annel capacit y in and out of t h e t op of a unitQt,rcEfully det ermined 
once c\ and r are ch osen and is given b y Equat ion 7.2. 



ci ■ r 
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( r " 2 ) (7.2) 



r rout er radix 

d rout er dilat ion 

w rout er w idt h 

c\ ch annels per logical direct ion t ow ard leaves 

(depends on packaging t ech nology and syst em requirement s 
r p logical t ree levels per ph ysical t ree level 

(det ermined b y rout er radix) 
c r ch annels out of unit t ree t ow ard root 

(det ermined b y; and r) 



Tab le7.3: Unit Tree Par amet ers 





Routing 


Components 

UT(a x % 


Up Rout ing St age 


16 


64 


Final Dow n Rout ing St age 


16 


64 


Middle Dow n Rout ing St aj 


;e 12 


48 


Init ial Dow n Rout ing St ag 


; 8 


32 



Tab le 7.4: Unit Tree Component Summary 



7.5.4 Unit Tree Examples 

For the sake of illust rat ion, let us consider t w ofkpamit t ree congurat ions int roduced in 
[DeH91] and [DeH90]. Here, w e denot e each unit tre£/a^. pXcr Both of these unit t rees use 
the RN1 rout ing component , a radix-4, dilat ion-2 rout ing component (See Ch apt ¥r2i). X 2 h as 
c\ = 2 and, consequent ly,c r = 32. UT^xgh as; = 8andc r = 128. Both h avf = 64. Tab le7.4 
summarizes the numb er of component s composing each st age of t h e net w ork in each unit t ree. If 
each rout ing component is h oused in a DSPGA372 package measuring 1.4 inch es along each side, 
and w e leave as much space b et w een rout ing componenfrsXikih^st ack measures just under 
1 foot along each side, and t hTff^xs measures just under 2 feet . Assuming BB372 connect ors, 
the four layers of rout ing component s in b ot h unit t rees are 1 inch t all. Wit h compressional plat es, 
each st ack is under 2 inch t all. 

At t h e leaves, a single unit t ree we; ifchro = no is connect ed t o each clust erjqf processors. 
To b uild the next size larger mach ine, w e replicat e t h e low er ph ysie^ltsonilDeS aod b uild a 
new t ree level out of enough unit t rees t o support t h e ch annels ent ering or leaving the root s of all 
of t h e low er ph ysical sub t rees. To quante$y;hf annels come out of each sub t ree w letvhls, 
t h e t ot al numb er of unit t rees required t o form the root of t h e ph ysical sub t ree at ph ysical t ree level 
n + 1 is given b y; 

^+1 = — (7.3) 
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In t urn, t h e ph ysical sub t ree root ed at ph ysical t ree4e^ew ill h ave a t ot al ch annel capacit y 
out of it s root given b y; 

c n+ i = c r ■ N n+ i (7.4) 

Wit h t h e part icular unit t rees just int roduced, w e form the leaves of t h e t ree b y connect ing one 
UT(£ X 2 t o each clust er of 64 processors. If w e tSf^xl un it t rees in t h e second level as w ell 
as t h first , w e neelf = 16 UT^4 X 2 unit t rees t o form the root of each ph ysical sub t ree root ed 
in t h e second ph ysical t ree level. Th e sub t ree root ed in t h e second ph ysical t ree level support 
64 2 = 4096 nodes and includes 64 UT^4 X 2 unit t rees in t hfiest ph ysical t ree level. Alt ernat ely, 
w e can usej^ = 4 UT^4 X ^, unit t rees t o compose the root of each sub t ree root ed in t h e second 
ph ysical t ree level t o support the same numb er of nodefiBstdlevel unit t rees. To b uild an even 
larger mach ine, w e can make 64 copies of a 4096-node t ree and int erconnect t h em using a t h ird 
ph ysical tree level composed oPj^ = 256 UT(,4 X 2 unit trees or^M 1 = 64 UT(,4 X $, unit trees. 
Th e result ing sub t ree root ed in t h e t h ird ph ysical t ree level supp©r2(s2}444 nodes and h as 
a t ot al of 64 16 = 1024 UT(a X 2 or 64 • 4 = 256 UT(^ X % unit t rees composing t h e int ernal t ree 
nodes in t h e second ph ysical t ree level and 405^64x2 un it t rees composing the nodes in tfrrsfc 
ph ysical t ree level. 

7.5.5 Hollow Cube 

To ph ysically organize the unit t rees w h ich make up a large fat t refidrenn aafianner, w e 
must consider t h e int erconnect t opology Each group, csub t rees at ph ysical t ree lewei/ ill 
b e connect ed t o t h e sub set of unit trees at ph ysical tue0-14wl h ich compose th e root ofth e 
sub t ree. If each of t h e sub t rees at t ree ie\cdmposed of U unit t rees and the parent t ree level 
is const ruct ed from the same size unit t rees, w e know the parent set of unit t rees w ill b e composed 
of 

TT - — TT 

u parent — u 

Cl 

unit t rees (See Equat ions 7.3 and 7.4). Th is gives us 

UchUdren = r p ■ U = T^" 1 ) • U (7.5) 

unit t rees at t ree level connect ing t o 

U pa rent = ~-U = T^ . U (7.6) 

Q 

unit t rees at t ree levci + 1 . From t h ese relat ions w e see t h at the group of unit t rees composing 
the root of a ph ysical sub t ree w ill generally b e connecttamfcsms many similarly sized unit 
t rees in t h e immediat ely low er ph ysical t ree level. 

Wh err = 4, as in our examples from the previous sect ion, a nat ural approach t o accommodat ing 
t h is:l convergence rat io in our t h ree-dimensional w orld is t o bhuildw cubes. We select one 
cub e face as t h"eor3' of t h e cub e and t ile t h e unit t rees composing the root of a ph ysical sub t ree 
in t h e plane across t h e t op face of t h e cub e. Toget h er t h e four adjacent faces in t h e cub e h ave four 
t imes the surface area of t h e t op and h ence can bet iled w it h four t imes as many unit t ree st acks. 
Th e sides can h ouse all of t h e unit t rees composing the root s oWh =e 44 immediat e ch ildren 
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Sh ow n h ere is an example w h ere the unit t rees in b ot h ph ysical t ree levels sh ow n are t h e 

same size. Th isiscomparab letoth e case describ ed in Sect ion 7.5.4 w h ere a 4096 processor 

mach ine w as b uilt from t w o ph ysical t ree levels composed enttffijy^funit t rees. 

Based on t h e st ack sizes assumed in Sect ion 7.5.4, t h is h ollow cub e w ould measure ab out 

4 feet on each side. 

Figure 7. 10: Tw o Level Hollow -Cub e Geomet ry 

of t h e cub e t op. If t h e t op is part of ph ysical t r©e tdved sides cont ain unit t rees w h ich are part 
of ph ysical t ree levei — 1 . We leave t h tb ot t dfrof the cub e open t o increase accessib ilit y t o t h e 
cub es int erior. Figures 7. 10 and 7. 1 1 sh ow tw oh ollow -cub e arrangement s for mach ines composed 
of t w o ph ysical t ree levels. Figure 7.12 depict s t h e h ollow -cub e arrangement for a mach ine w it h 
t h ree ph ysical t ree levels. 

Th e h ollow -cub e t opology is opt imized t o expose the surfaces of st acks w h ich int erconnect 
t o each ot h er. All of t h e int erconnect b et w een unit t rees w it h in a h ollow -cub e fat t ree w ill occui 
b et w een the sides and face of some cub e. Th e h ollow -cub e t opology is a t h ree-dimensional fract al- 
like geomet ry w h ich at t empt s t o maximize the surface area exposed for int erconnect w it h in a given 
volume. 

7.5.6 Wiring Hollow Cubes 

Each of t h e unit t ree st acks in t h e sides of a cub e feeds connect ions t o and from unit t rees 
composing the parent sub t ree in t h e t op of t h e cub e. All of t h e connect ion in and out of t h e t op of a 
unit t ree st ack are logically equivalent . Th ese logically equivalent ch annels sh ould b e dist rib ut ed 
among the unit t rees composing the parent sub t ree for fault t olerance. Th is fanout from a unit t ree 
t o mult iple unit t rees in t h e parent sub t ree is desirab le for t h e same reasons fanout from the dilat ed 
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Sh ow n h ere is an example w h ere the unit t rees in t h eh igh er ph ysical t ree level are four 
t imes as large as t h e ones in t h e low er ph ysical t ree level. Th is is comparab le t o t h e case 
describ ed in Sect ion 7.5.4 w h ere t h e low est tree level w as conY]5^e4 <nfiit trees 
w h ile t h e h igh er level w as comp&KEjdt^f unit t rees. Like Figure 7.10, t h is h ollow 
cub e measures ab out 4 feet on each side. 

Figure 7.11: Tw o Level Hollow Cub ew it h Top and Side St acks of Different Sizes 

connect ions of a single rout er is desirab le (See Sect ion 3.5.3). Wit h proper fanout ent ire unit t ree 
st acks can b e removed from the non-leaf, ph ysical t ree levels, and the net w ork st illfreieaiti suf 
connect ivit y t o rout e all connect ions. 

Wire connect ions are made t h rough the cent er of each h ollow cub e using cont rolled-impedance 
cab les. Th e w orst -case w ire lengt h b et w een t w o ph ysical t ree levels is proport ional t o t h e lengt h 
of t h e side of t h e cub e w h ich t h e w ire t raverses. Any rout e t h rough the net w ork t raverses a cub 
of a given size at most t w ice, once on t h e pat h t o t h e root and once on t h e pat h from the root t o t h e 
dest inat ion node. 

7.5.7 Hollow Cube Support 

To support the unit t rees making up a h ollow cub e, w e b uild a gridded support sub st rat e much 
like the raisedoors used in t radit ional comput er rooms. Due t o t h e ph ysical size of t h eh ollow 
cub es, t h ey occupy room-sized or b uilding-sized st met ures. Th e st ruct ure t o h ouse t h e h ollow -cub e 
net w ork is b uilt w it h t h ese gridded w alls and ceilings t o accept the unit t rees w h ich are used as 
b uilding b locks. Conduit s for pow er and coolant are accommodat ed along grid lines in t h e gridded 
sub st rat e. Th e sixt h face of each h ollow cub e, vacant of unit t rees, supplies access t o t h e int erior 
and provides a locat ion for cooling pumps and pow er supplies. 
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Sh ow n ab ove is a h ollow cub e cont aining t h ree ph ysical t ree levels. If all t h e unit t rees 
w ereVT64 X 2 unit t rees, t h is st met ure w ould h ouse 262,144 endpoint s. Making the same 
assumpt ions as in Figure 7.10, the cent ral cub e in t h is st ruct ure measures ab out 16 feet 
along each side. Th e w h ole unit , as sh ow n, w ould measure 24 feet along each side and b e 
16 feet t all. 



Figure 7.12: Th ree Level Hollow Cub e 
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Rat h er t h an direct ly connect ing t h e w ires int o a unit t ree t o t h e unit t ree it self, w e can b uild a 
w iring h arness w h ich mat es w it h the unit t ree. Th is h arness collect s all t h e w ires connect ing t o a 
single unit t ree. Th e h arness makes compressional cont act w it h eit h er t h e t op or b ot t om of a unit 
t ree st ack t o connect t h e w ires t o t h e unit t ree. Th is w iring h fiaaetsh sampik of replacing 
a unit t ree st ack. Wit h out t h e h arness it w ould b e necessary t o unplug all of t h e connect ions int o 
the out going unit t ree and t h en reconnect t h em t o t h e replacement . Since each unit t ree generally 
support s h undreds of connect ions, t h is operat ion w ould b e involved and h igh ly error prone. 

7.5.8 Hollow Cube Limitations 

As int roduced, h ollow cub es are only w ell mat ch ed t o radix four fat -t rees and only ret ain many 
of t h eir nice propert ies up t o t h ree ph ysical t ree levels. FdirSthtdi ree t ree levels, the cub e 
side lengt h s increase b y a fact or of four b et w een t ree levels. Since each successive t ree level 
accomodat es 64 t imes as many processors, side lengt h , and h ence w orst -case w ire lengt h s, grow s 
as \/N . St art ing at t h e fourt h ph ysical t ree level, the need t o accommodat e space occupied b y 
low er t ree levels increases the grow t h fact or t o six. As a result w orst -case w iring lengt h s grow 
fast er b eyond t h is point . Also st art ing at t h e fourt h ph ysical t red'hevfllt offisBf many of 
t h e h ollow cub es b ecome b locked b y ot h er h ollow cub es limit ing maint enance access. 

7.6 Multi-Chip Modules Prospects 

Th e Mult i-Ch ip Module (MCM) is an emerging packaging t ech nology t h at furt h er improves 
component packaging densit y b y dispensing w it h t h e IC package. Bare die are b onded direct ly 
t o a h igh -performance sub st rat e w h ich serves t o int erconnect the die. Th e removal of t h e IC 
package allow s component s t o b e sit uat ed more closely low ering int erconnect lat ency. Recall from 
Sect ion 7.2.1 t h at package size is proport ional t o t h e spacing of ext ernal i/ o pins not the die size. 
Avoiding the package allow s t h e component t o only t ake up space relat ive t o t h e die size. 

Unfort unat ely, MCM t ech nology h as a numb er of draw b acks w h ich relegat e it s use t o small, 
h igh -end syst ems t oday. Few IC manufact urers are in t h e pract ice of supplying b are, t est ed die. 
Final, full-speed IC t est ing is generally done aft er t h e die is packaged. Th e facilit ies availab le for 
full-scale t est ing of unpackaged die are limit ed. As a result , it is generally not possib le t o know 
w h et h er all of t h e die w ill w ork b efore assemb ling an MCM. Since component speed grading is 
also generally only performed on packaged ICs, one h as lit t le know ledge of t h e yielded operat ional 
speed for each IC. Th ese draw b acks are compounded b y t h e fact t h at repair and rew ork t ech nology 
for MCMs is in it s infancy. Th e MCM t ech nologies availab le t oday generally are not amenab le 
t o die replacement . Consequent ly, st ocked MCM yield is low . Addit ionally, NRE cost s on MCM 
sub st rat es are comparab le t o silicon IC NRE cost s rat h er t h an PCB NRE cost s. Th e comb inat ion 
of t h e fact t h at MCMs h ave yet t o b ecome a h igh -volume t ech nology, the low er yield due t o lack 
of repairab ilit y, and h igh NRE cost s make MCM t ech nology uneconomical for most designs at 
present . 

Wh en MCMs b ecome an economically viab le t ech nology, t h ey may b e ab le t o replace packaged 
ICs and PCBs. St acks composed from layers of MCMs could b e a fact or of 3 t o 4 smaller in each 
of t h e planar dimensions t h an t h e st acks describ ed w it h DSPGA372 st yle component s. To b uild 
MCM st acks, w e w ould need t h e ab ilit y t o connect signals int o b ot h sides of an MCM sub st rat e. 
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If t h e MCM is limit ed t o single-sided or periph eral i/ o, t h e size of t h e MCM required t o sat isfy i/ o 

requirement s alone may negat e much of t h e densit y fit sndJnless signficant advances are made 

in MCM repair, MCM st acks w ould not h ave the repair advant ages of t h e current st ack packaging 

sch erne. Th eh ollow -cub e t opology can b e used t o int erconnect MCM st acks w ith most ofth esame 

b enfit s and limit at ions. 

7.7 Summary 

In t h is ch apt er w e developed a t h ree-dimensional st ack packaging t ech nology. Using dual- 
sided pad-grid array IC packages, compressional b oard-t o-package connect ors, and convent ional 
PCBs, w e developed a st ack st ruct ure w it h alt ernat ing layers of component s and PCBs. Th e 
comb inat ion of DSPGA packages and compressional connect ors served t o b ot h connect packaged 
ICs t o h orizont al PCBs and t o provide vert ical b oard-t o-b oard int erconnect w it h in t h e st ack. We 
demonst rat ed h ow t o map a mult ist age net w orks int o t h is t h ree-dimensional st ack st ruct ure. We 
t h en looked at t h e limit at ions on t h e size of st acks w e can b uild. To accommodat e larger syst ems, 
w e developed a net w ork decomposit ion for fat -t ree, mult ist age net w orks w h ich allow s us t o b uild 
large fat -t ree net w orks from one or t w oprimit ive unit t ree st ack designs. We also sh ow edh ow these 
unit t ree st acks can b e arranged in a h ollow -cub e geomet ry t o const ruct large fat -t ree mach ines and 
comment ed on t h e limit at ions of t h eh ollow -cub e st ruct ure. 

7.8 Areas To Explore 

Many areas of packaging are quit e fert ile for explorat ion. 

• We h ave point ed out the limit at ions w it h current MCM t ech nology and suggest ed some 
requirement s necessary for t h e t ech nology t o provide real fit sne 

• Th e h ollow -cub e t opology h as many nice propert ies up t o it s limit at ions. It w ould b e useful 
t ofmd alt ernat ive t opologies w it h aw ider range of applicat ion. 

• If free- space opt ical int erconnect b ecomesaviab let ech nology on th is scale, th eh ollow cub es 
can b ecome t ruly h ollow . Using free-space opt ical t ransmission across the long dist ances 

t h rough the cub e, w e could exploit the propagat ion rat e of ligh t t o keep t ransit lat encies 
low . Th e dist ance across the larger h ollow -cub e st ageJeht iuflarge t h at the savings 
due t o h igh er propagat ion rat e may make up for t h e lat ency associat ed w it h convert ing 
elect rical signals t o ligh t and b ack again. Recent w ork in opt ics promises t o int egrat e 
elect rical and opt ical processing so t h at w e w ill b e ab le t o b uild opt ical conversions int o our 
primit ive rout ing element s [Mil91]. Since ph ot ons do not int erfere w it h each ot h er, free- 
space opt ics makes t h e t ask of w iring t h e int erconnect ions t h rough the cent er of t h eh ollow 
cub e t rivial. [B Jfr86] and [WB J + 87] discuss early w ork on large-scale, free-space opt ical 
int erconnect for VLSI syst ems. Th ey use h olograph ic opt ical element s t o direct opt ical 
b earns for int erconnect ions. Trflexib ilit y of t h eh olograph ic media h olds out promise for 
adapt ive and dynamic connect ion alignment and recdrgurat ion. At present , much w ork is 
st ill needed on eficient conversion b et w een elect rical and opt ical signals and emit t er-det ect or 
alignment . 
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Part III 

Case Studies 
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8. RN1 



RN1 is a circuit -sw itch ed,crossb ar rout ing element developed in t h e MIT Transit Project [MDK91]. 
RN1 may b e cofigured eith er as an 8-b it w ide, radix-4, dilat ion-2 crossb ar rout era (= 8, 
r = 4, d = 2, w = 8) or as a pair of independent , radix-4, dilat ion-1 rout ersz.^. two rout ers 
w ith= 4, r = 4, d = 1, w = 8) (See Figure 8.1). In b ot h dbgurat ions, RN1 supports the 
b asic rout ing prot ocol det ailed in Sect ion 4.5. RN1 h as no int ernal pipelining. Each RN1 rout er 
est ab lish es connect ions and passes dat aw it h a single clock cycle of lat ency. 

Figure 8.2 sh ow s t h e micro- arch it ect ure for RN1. Each forw ard and b ackw ard port cont ains 
a simple finit e-st at e mach ine for maint aining connect ion st at e and processing prot ocol signalling. 
Th e line cont rol unit s keep t rack of availab le b ackw ard port s and h andle random port select ion. 
Backw ard port arb it rat ion occurs in a dist rib ut ed fash ion along each logical out put column. Wh en 
several forw ard port s at t empt t o open a connect ion t o t h e same logical b ackw ard port during the 
same cycle, an 8-w ay arb it rat ion for t h e availab le b ackw ard port s t akes place. [Min91] cont ains a 
det ailed descript ion of t h e design and implement at ion of RN1. 

RN1 w as implemented as a full-cust om>lOS integrated circuit using a comb inat ion of 
st andard-cell and full-cust om layout . St andardjve-volt , CMOS i/ o pads w ere used w it h t h is 
first -generat ion rout ing component . RN1 w as fab ricat ed in Hew let t 'Pdctyaid CMOS process 
(CMOS34) t h rough the MOSIS service. Th e RN1 die measures 1.2 cm on each side. Th e die size 
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Figure 8.1: RN1 Logical Configurat ions 
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Figure 8.2: RN1 Micro-arch it ect ure 

w as fully det ermined b y t h e perimet er required t o h ouse 160 signal pads plus pow er and ground 
connect ions in a single row of periph eral b onding pads. RN1 is packaged in a DSPGA372 package 
as sh ow n in Figure 8.3. 

RN1 can support clock rat es up t o 50 MHz. Analysis of t h e crit ical-pat h t iming indicat es 
t h at t h e st and&\ee,-volt i/ o pads and t h e st andard clock b uffer are key cont rib ut ors limit ing 
clock frequency. Th e input and out put lat encies for t h e RN1 i/ o pads are each rough ly 10 ns 
{i.e. ti = 20 ns). Inside th e i/o pads, the latency th rough t h e IC logic is aroundd4 ns ( 
t switch = 14 ns). 
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RN1 is h oused in t h e DSPGA372 package int roduced in Sect ion 7.3.1. 



Figure 8.3: Packaged RN1 IC 
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9. Metro 



Th e Mult ipat h Enh anced Transit Rout ing OrganizaMEfnR0) is an arch it ect ure for second 
generat ion Transit rout ing component s. ThMETRO arch it ect ure encompasses t folBP-ROUTER 
prot ocol describ ed in Ch apt er 4 including the enh ancement s describ ed in Sect ion 4.9. In addit ion 
t o t h e b asic rout er prot ocol implement ed b y M]&ffe,0 includes mult i-TAP scan, port -b y-port 
deselect ion, part ial-ext ernal scan, w idt h cascading, fast pat h reclamat ion, and pipelining provisions. 

9.1 metro Architectural Options 

Th eMETRO arch it ect ure encompasses a large space of rout ing component fignrat ions and 
routerb eh avior. Some arch it ect ural paramet ers nnfetetl w h en const ruct ing a part icular rout ing 
component . Any part icular rout er w ill h a-fead dat a w idt bj),(a fixed numb er of input s and 
out put s i, o), a fixed numb er of pipeline delays rout ing dat a t h rough the rdpfe^r s( fixed 
numb erofh eader w ords sw allow ed during connect ion est ab liste)miania( fixed numb er of 
scan pat h ss(p). Each part icular rout er w ill h avdigomt ion opt ions, accessib le via the TAP, 
w h ich allow onetoch oose among a set ofpossib le routerb eh aviors. (BgaKeaanjoMETRO 
rout ing component t o act as a radiK,-dilat ion*? rout er (p = r x d) b y set t ing the effect ive dilat ion. 
Each part icular rout er w ill h ave a maximum limit on t h e dilat ion setotei_n§.( Each forw ard 
and b ackw ard port on ai»$ETRO router can b e enab led or disab led (See Section 4.9.1). It is 
also possib le t o ccfigure each forw ard and b ackw ard port oMEffiRO rout ing component t o 
accommodat e t h e pipeline delay cycles associat ed w it h pipelining dat a on t h e w ires b et w een rout ers 
(See Sect ion 4.1 1.3). Th e cdigured value, vtd, defines the numb er of cycles w h ich w ill t ranspire 
b et w een t h e t ime w h en a portTfcKdand t h e t ime w h eifitsbi piece of ret urn dat a arrives. 
Each part icular rout ing component w ill alkfa# tot ake on any value up t o some component 
specific maximum, max jvtd. Each forw ard and b ackw ard port can also ffiguced t o eft h eruse 
fast pat h reclamat ion or det ailed connect ion sh ut dow n (See Sect ion 4.9.2). Tab le 9.1 summarizes 
the arch it ect ural variab les w h ich must b e select ed during the const ruct ion of a rout ing component . 
Tab le 9.2 summarizes t h e dbgurat ion opt ions w h ich areavailab lMBffiffiO rout ing component . 

9.2 metro Technology Projections 

Based on our experience w it h RN1 and the signalling t ech nology describ ed in Ch apt er 6, w e 
b elieve w e can b uild a comparab ly sME^RO rout er in Hew let t Pack^Q.S^m effect ive gat e- 
lengt hCMOS process (CMOS26) w h ich operat es at clock frequencies up t o 200 MHz. As not ed in t h e 
previous sect ion, the crit ical pat h for connect ion est ab lish ment in RN1 w as under 14 ns. Th rough 
a comb inat ion of t ech nology scaling and clever circuit t ech niques, w e can reduce t h is allocat ion 
lat ency t o 5 t o 10 ns. Thflow -t h rough lat ency on RN1 w as much less t h an t h e allocat ion lat ency. 
Rout ing dat a b et w een a forw ard port and b ackw ard port of an open connect ion w it h less t h an 5 ns 
of lat ency sh ould b e quit e manageab le. Consequent ly, w e expect a part operat ing at 200 MHz can 
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Variable 


Function 


Range 


sp 


Numb er of Scan Pat h s 


sp > 1 


w 


Bit Widt h of Dat a Ch annel 


w > maxjd 


maxjd 


Maximum Dilat ion 


max jd = 2 n for some n > 
maxjd < o 


i 


Numb er of Forw ard Port s 


i = 2 n for some n > 





Numb er of Backw ard Port s 


o = 2 n for some n > 
o > maxjl 


ri 


Numb er of Random Input s Bit s 


ri > 1 


hw 


Numb er of Header Words Consumed 
Per Rout er 


hw > 


dps 


Numb er of Dat a Pipest ages Th rough Routes > 2 


max_vtd 


Maximum Numb er of Delay Slot s 

Availab le for Variab le Turn Delay 


max_vtd > 



Th is t ab le summarizes the arch it ect ural variab les w h ich dist inguish aMppact icular 
rout ing component . 

Tab le 9.1 METRO Arch it ect ural Variab les 



Option 


One for 
Each 


Number of 
Instances 


Bits Each 


Dilat ion 0) 


component 


1 


|_log 2 (/ og2(max_d))\ + 1 


Port (De)select 


port 


i + o 


1 


Deselect ed Port Drives Out pu 


t port 


i + o 


1 


Fast Reclamat ion 


port 


i + o 


1 


Turn Delay (vtd) 


port 


i + o 


log2(maxjvtd) 



Each METRO rout er h as several cdhgurat ion opt ions w h ich cont rol it s b eh avior. Th is t ab le 
summarizes the opt ions common t o alETRO rout ing component s. 

Tab le 9.2: Mtro Rout er Comgurat ion Opt ions 

b eb uilt w it h one cycle of dat alatfpncy (1) and one or t w o cycles of connect ion est ab lish ment 
lat ency (fiw = or hw =1). Using t h e i/ o pads det ailed in Ch apt er 6, t h e delay t h rough a pair of 
i/ o pads is 3 ns. As long as t h e propagat ion delay b et w een 'a6wndp©int s is under 2 ns, t h e i/ o 
pads and int erconnect serve as a single pipeline st age. Wit h convent ional PGB=£ 4), w ire runs 
up t o 30 cm in lengt h can bet raversed in a single 200 MHz clock cycle. 
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10. Modular Bootstrapping Transit Architecture (MBTA) 



Th e Modular Boot st rapping Transit Arch it ect ure (MBTA) is a series of small mult iprocessors b ased 
around mult ist age rout ing net w orks composed of RNlMETRO rout ing component s. MBTA 
int egrat es a numb er of minimal processing nodes w it h a mult ist age net w ork organized as describ ed 
in Sect ion 3.5. 

10.1 Architecture 

Figure 10.1 sh ow sthenetw ork used for a 64-processor MBTA mach ine. Each processing node 
h as t w o net w ork input s and two net w orfe»at pM s=(2) for fault t olerance. Th e net w ork 
sh ow n is composed of RNl-st yle rout ing component and uses the dilat ion-1 roufigiirarrion 
in t h final st age so t h at t w o different rout ing component s may provide net w ork out put s from 
the net w ork t o each processing node. Since RN1 is a radix-4 rout ing component , t h e net w ork is 
comprised of log 4 (64) = 3 rout ing st ages. 

Figure 10.2 sh ow s t h e arch it ect ure of t h e MBTA processing nodes. Each node is composed 
of a RISC microprocessor (e.g. Int els 80960CA [MB88] [Int 89]), fast , st at ic memory, net w ork 
int erfaces, and support logic. Four logical net w ork int erfaces service t h e t w o connect ion int o and 
t h e t w o connect ion out of t h e net w ork. Th e processor performs comput at ion, init iat es net w ork 
communicat ions, and services non-primit ive net w ork operat ions. Th e processor is also responsib le 
for t h e h igh est levelsiBf-ENDPOlNT, w h ich are not h andled b y t h e net w ork int erf ace. A single, 
h igh -speed memory b ank serves t o h old inst met ions and dat a for t h e processors, st ore dat a coming 
and going from the net w ork, and st ore connect ion st at us informat ion. Th e b asic node arch it ect ure 
also h as provisions t o support co-processors and alt ernat e forms of memory. In order t o int erface 
MBTA mach ines w it h exist ing comput ers and dat a net w orks, t h ere are provisions for some nodes 
t o accommodat e ext ernal int erfaces. 

10.2 Performance 

Th e MBTA arch it ect ure h as b een b alanced t o support b yt e-w ide net w ork connect ions running 
at 100 MHz. Th e net w ork int erfaces send dat a from the fast , st at ic memory and receive dat a 
int o t h e memory, as w ell. Consequent ly, each net w ork int erface requires 100 megab yt es/ second 
(100 MB/ s) of b andw idt h int o memory during sust ained dat a t ransfers. Th e processor is running 
at 25 MHz and may read up t o one w ord, or four b yt es, per cycle during b urst memory operat ions. 
To prevent the processor from st ailing, it , t oo, needs 100 MB/ s of b andw idt h int o memory. To run 
all net w ork int erfaces and the processor simult aneously at full-speed, w e w ould need 500 MB/ s of 
b andw idt h int o memory. To simplify the prob lem, w e rest rict operat ion so t h at only one net w ork 
input may b e feeding dat a int o t h e net w ork at a t ime. Th is rest rict ion limit s t h e cont ent ion in t h e 
net w ork w h ile giving us t h e fault t oleranfie sbofiihe aving two connect ions int o t h e net w ork. 
To provide the 400 MB/ s of b andw idt h required, w e use 64-b it w ide, 20 ns, synch ronous SRAM 
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Sh ow n h ere is t h e net w ork for a 64-processor MBTA mach ine composed of RN1 rout ing 
component s. 

Figure 10.1: MBTA Rout ing Net w ork 
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Sh ow n ab ove is t h e arch it ect ure for each MBTA node. Th e unit s out side of t h e dot t ed b ox 
are common t o all MBTA nodes. Wit h in an MBTA mach ine, a few nodes w ould support 
ext ernal int erf aces. 



Figure 10.2: MBTA Node Arch it ect ure 
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for t h e h igh -speed memory on a pipelined b us. Each of t h e four unit s using the memory get s t h e 
opport unit y t o read or w rit e one, 8-b yt e value t o or from memory every 80 ns. Th is allow s each 
unit t o sust ain 100 MB/ s dat a t ransfers w it h out int ernal b uffering as long as dat a can bet ransferred 
as cont iguous doub le-w ords. 

If all nodes are b usy sending dat a at 100MB/ s, a 64-processor net w ork, like the one sh ow n 
in Figure 10.1, can support a peak b andw idt h of 6,400MB/ s = 6.4GB/ s. With one net w ork input 
and b ot h net w ork out put s in operat ion, a single node can simult aneously t ransfer up t o 300MB/ s. 
Running t h METRO component describ ed in Sect ion 9.2 at 100 MHz, it t akes one cycle tot raverse 
each rout er and one cycle tot raverse each w ire in t h e net w ork. Th e unloaded lat ency t h rough 
the net w drk n / oa( fed, * s 70 ns arising from 10 ns of lat ency t h rough each of t h e t h ree rout ing 
component s in any pat h t h rough the net w ork and 10 ns of lat ency t h rough each of t h e four ch ip 
crossings b et w een net w ork endpoint s. If our t ech nology project MH^lfOrh old, w e could 
implement a version w it h RNl-st yle pipelining and cut th is lat ency in h alf. Alt ernat ely, if w e could 
cycle t h e pipelined memory b us t w ice as fast or increase th e memory w idt h to 1284) it s and require 
1643 yt e dat a t ransfers, w e could support 200 MB/ s net w ork connect ions ulMgptehreut er at 
full speed. Th is w ould cut the unloaded net w ork lat ency in h alf t o 35 ns. Th is ch ange w ould also 
doub le t h e b andw figttrBS ab ove and cut t h e t ransmission t ffi^nsmit, in h alf. For t h is size 
of a net w ork, t h e t ot al t ime t o communicat e a message from one node t o aS^ e h S( e^, , w ill 
b e dominat ed b y t h e net w ork input and out put %$ ondyTw and t h e t ransmission lat ency, 

T 

-*- transmit- 



170 



11. Metro Link 



METRO Link (mlink) is a net w ork int erface designed t o connect the processor and memory on 
an MBTA node to a METRO net w orkMLlNK h andles th e core port ions MRP-ENDPOINT (See 
Sect ion 4.7) and provides support so t h e node processor can h andle the remainder. 

11.1 mlink Function 

MLINK performs all of t h e low -level operat ions necessary for an endpoint t o send and receive 
data over a METRO netw orkMLlNK h andles control and signalling w h ich must operate at t h e 
net w ork speed. It also h andles t h ose operat ions w h ich must b e implement ed in h ardw are t o exploit 
the full b andw idt h of t h e net w ork port s and keep end-t o-end net w ork AffiONKjldajxes . 
infrequent diagnost ic operat ions, cert ain kinds of message format t ing, and policy decisions t o t h e 
node processor. 

mlink's primary funct ion is t o convey dat a b et w een the net w ork arMiaiffiradffy MLINK 
moves dat a b et w een the doub le-w ord w ide memory b us, on w h ich it get s one cycle once every 
80 ns, and t h e b yt e-w ide net w ork port operat ing at 10®lMHz.adds cont rol b yt es t o t h e 
dat a st ream <(.,?• ROUTE, TURN, DROP) t o open, reverse, and close net w ork connect ioiMLlNK 
also generat es and verfies the end-t o-end message ch ecksums used t o guard message t ransmission. 
MLINK w ill ret ry failed connect ions w it h out processor int ervent ion up t o some proces&adspeci 
numb er of t rials. A pair oMLlNK net w ork input s w ill arb it rat e using randomizat ion t o det ermine 
w h ich input is used for each connect ionMrMK net w ork out put s can h andle the recept ion of 
a small set of primit ive messages (See Sect ion 1 1.3) w it h out processor int ervent ion. For all ot h er 
messages, MLINK queues the incoming dat a t o b eh andled b y t h e processor 

Operat ions w h ich are more complicat ed and infrequent are left t o t h e node processor. Th e 
processor is responsib le for packet launch and for source-queuing of messages w h ile t h e net w ork 
input is b usy. Th e processor det ermines h ow t o proceedMLlltoKdails t o deliver a message in 
the spefiied numb er of t rials. Th e processor is also responsib le for allocat ing space for incoming 
remot e funct ion invocat ion messages and for processing and dequeuing the messages as t h ey arrive. 

Wh en congured t o do so, MLINK w ill st ore connect ion st at us informat ion for successful and 
failed connect ions. Th is informat ion includes t h e st at us and ch ecksum w ords ret urned from each 
rout er in an allocat ed pat rMLlNK leaves t h e t ask of int erpret ing t h is informat ion t o t h e processor. 

A few t asks are also left t o t h e processor in order t o limit t h evrlflNialHeeds t o know ab out 
the message prot ocol or at t ach ed net w ork. Th e remot e funct ion invocat ion protfidfissnlnarfd 
flexib le opport unit y t o cust omize low -overh ead messages for a part icular applidttLi&K only 
provides b asic t ransport and queuing of t h ese message t ypes, leaving format t ing and int erpret at ion 
t o t h e processoMLlNK also leaves the select ion of rout ing w ords t o t h e processor. Th is allow s 
the processor t o select a part icular pat h in ext ra-st age, mult ipat h net w orks (See Sect ion 4.7.1) and 
prevent s MLINK from needing t o know the det ails of format t ing rout ing w ords for any part icular 
net w ork (See Sect ion 4.6). 
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11.2 Interfaces 

Onth e node sideMLlNK connect s t o t h e64-b it , pipelined b us. Th isb us serves tw o purposes for 
each MLINK int erface. Recall from Ch apt er 10 t h at each net w ork int erface on an MBTA node h as 
a designat ed cycle on t h e pipelined b us once every 80 MLINK uses t h is slot t o read or w rit e dat a 
from the fast memory at 100MB/ s. Th e processor also h as a designat ed slot on t h e pipelined b us. 
During the processor slot , it may read or w rit e 32-b it values at memory-maflpEdK addresses. 
Th ese memory-mapped addresses allow the processor t o: 

1. configure each MLINK 

2. launch or ab ort net w ork operat ions 

3. ch eck on t h e st at us of e&flhTNK's ongoing or recent ly complet ed operat ions 

On the net w ork sMfcjNK h as a b yt e-w ide net w ork port w h ich b eialE^res fifaena ard 
or b ackw ard port . Th e net w ork port h as t h e same figuiof ism opt ions as eachMETRO 
forw ard and b ackw ard port (See Tab le 9.2). 

11.3 Primitive Network Operations 

MLINK dist inguish efive kinds of primit ive net w ork operat ions: 

1. READ 

2. WRITE 

3. RESET 

4. NOOP 

5. ROP (remot e funct ion invocat ion) 

Th (RESET, NOOP, READ, and WRITE operat ions are h andled ent irely b y t h e receiMMgJK w it h out 

involving the processor, w h ereas the remot e funct ion invocat ion is only queMfcdlNto: y o b e 

h andled b y t h e node processor. IteA© operat ion performs a mult i-w ord, memory read operat ion 

on t h e remot e node, ret urning the dat a at t hfedpaixilress t o t h e source. ThVKlTE operat ion 

performs the complement ary funct ion, allow ing dat a t o b e w rit t en int o a refsafiemorije 

Th ese operat ions are direct , h ardw are reads and w rit es and are associat ed w it h no guards for 

coh erence. Th RESET operat ion signalsMLlNK t o release the associat ed processor and allow it t o 

b oot . Comb ined w it ^KUteoperat ion, t h is primit ive allow s each node t o b e remotfi^pedn 

and b oot ed across the net w ork.NCh)Peoperat ion performs no funct ion on t h e dest inat ion node 

b ut does ret urn connect ion st at us informat ion w h ich is useful during t est ing. 

Remot e funct ion invocat ion is a generic primit ive w h ich allow s soft \#gaiatdon of 
arb it rary message t ypes and remot e net w ork funct Mfiffi>JK simply conveys the spefibd dat a 
and a dist inguish ed address from the source endpoint t o t h e dest inat ion via the net w ork. Th e 
dest inat iorMLlNK queues the arriving dat a and address on t h e incoming message queue for t h e 
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mlink Message Formats: 

(ROUTE)* o RESET o (DATAc ksum ) 2 o TURN 

(ROUTE)* o NOOP o (DATA cfcstlm ) 2 o TURN 

(ROUTE)* o READ o len o (DATA a(Wr ) 3 o (DATA cfcstlm ) 2 o TURN 

(ROUTE)* o WRITE o len o (DATA^,.) 3 o (DATA cksum f o (DATA„,„ te ) (8 *' en) o (DATA cfcstlm ) 2 o TURN 

(ROUTE)* o ROP o len o (DATA a(Wr ) 3 o {T)ATA cksum f o (DATA)( 8 /en ) o {T>ATA cksum f o TURN 

mlink Reply Formats: 

(DATA mUnk _ status ) 2 o ACK/NACK o DROP 

(DATA mKnk _ status ) 2 o (DATA_IDLE)* o (DATA read )( 8 ' en ) o (DATA cfcstlm ) 2 o DROP 



Each primit ive message t ype h as it s ow n init ial message format . Wh ere det ermined b y 
MLINK, superscript s indicat e t h e numb er of b yt es composing each port ion of t h e message 
dat a. Th e read operat ion is t h e only primit ive message w h ich receives dat a along w it h it s 
reply message. 

Figure 11.1: MLINK Message Format s 

dest inat ion processor t o service. Th e dest inat ion processor dequeues each message and invokes the 
funct ion at t h e spfied address w it h the associat ed da^S^IEalls t h is kind of low -overh ead, 
remot e code invocat ion urActive Message. Th is primit ive export s t h e b asic funct ionalit y of t h e 
net w ork t o t h e soft w are level w h ere cust om message h andlers can b e craft ed in soft w are for each 
applicat ion or run-t ime syst em. 

Figure 11.1 summarizes the message format s used tMyiNK net w ork int erfaces t o perform the 
primit ive net w ork operat ions. Wh ere appropriat e, t h e t arget address and dat a lengt h are guarded 
w it h t h eir ow n ch ecksum so t h at dat a can b e w rit t en int o memory w h ile it is b eing received (See 
Sect ion 4.7.4). In reply t o aWRlTE, RESET, NOOP, or ROP message, MLINK sends st at us informat ion 
and an acknow ledgement . Wh en replying t o an uncorruptEdD operat ion,MLlNK ret urns the dat a 
associat ed w it h the read.DlbA£lDLE w ords preceding the read dat a in t h e read reply message 
are used t ofill in any delays b efore t hfiest b yt e of read dat a is availab le. Th is delay arises part ially 
from the need tow ait forftrbt ffead dat a on t h e node dat a b us and part ially from the rfeMd t o 
pipeline st ages w it hvfflfflNK w it h reply dat a. 

Th ese primit ives form a minimal set of net w ork primit ives. Th ey provide ah igh -digteeof 
ib ilit y for a general-purpose mult iprocessor. In sit uat ions w h ere the applicat ion and programming 
model are limit ed or b iased t o a part icular domain, it may make sense t o cust omize a net w ork int er- 
face w it h addit ional net w ork primit ives implement ed direct ly in h ardw are w it h out'st h e processor 
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int ervent ion. For inst ance, w h en b uilding a dedicat ed, sh ared-memory mach ine w it h a part icular 
memory-model in mind, it w ould b e bfisahal t o provide primit ive net w ork operat ions t o h andle 
coh erent memory t ransact ions. 
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12. MBTA Packaging 



In Sect ion 7.4, w e sh ow ed h ow t o package a net w ork using t h e st ack packaging t ech nology 
int roduced in Ch apt er 7. Here, w e consider packaging an ent ire 64-processor MBTA mach ine. 

12.1 Network Packaging 

Packaging the net w ork is a simple applicat ion of t h e net w ork t o st ack packaging mapping 
introduced in Section 7.4. As sh ow n in Figure 10.1, a 64-processor net w ork b uilt out of RN1 
component s, or comparab ly sizedlETRO component s, h as 16 rout ers in each st age. We arrange 
t h ese rout ers in a 4 4 grid arrangement as sh ow n in Figure 12.1. Wit h the rout ers packaged in 
DSPGA packages and placed on 3 inch cent ers, each rout ing b oard is rough ly 12 inch es square. 

12.2 Node Packaging 

We can package the nodes inside the same st ack b y h ousing the larger, VLSI component s in 
DSPGA packages and using gull-w ing surface-mount component s for memory and b us logic. By 
sh aring t h ei/o pads associat ed w it h the 64-b it , node dat a b us, w e can int egrat e all four logical 
net w ork int erfaces on a single die and place the die in a DSPGA package. Th e processor and 
cust om b us cont rol logic can each b e placed in t h eir ow n DSPGA package. Th e memory can b e 
ob t ained in gull-w ing, surface-mount packages. Th e b us int erface logic can b e packaged in SSOP 
packages w it h a 25 mil pad pit ch [Tex91]. By adding a fourth DSPGA package, w e can package a 
node on a 6 inch square PCB w it h the DSPGA component s cent ered 3 inch es apart . Th e memory 
and glue logic can b e placed on t h e surface of t h e node PCB b et w een the DSPGA packages as 
sh ow n in Figure 12.2. Th e fourt h DSPGA package can b e used eit h er t o h ouse addit ional node 
logic or as a b lank for mech anical support and vert ical signal cont inuit y. Th is arrangement allow s 
us t o st ack four node PCBs on t op of t h e net w ork rout ing b oards and align the DSPGA packages in 
the net w ork and nodes (See Figures 12.3 and 12.1). 

To accommodat e addit ional logic or memory for each node, w e can b uild daugh t er b oards of 
the same size and use vert ical connect ivit y t o int erconnect t h e b oards. As long as t h e signals w h ich 
connect t o t h e addit ional logic or memory are availab le on t h e pads of one of t h e four DSPGA 
packages on t h e node, an adjacent PCB h as access t o t h ese signals. A DRAM memory card, for 
example, could b e b uilt b y h ousing the DRAM cont roller in a DSPGA package, and packaging the 
DRAM in TSOP packages b et w een the DSPGA component sit es. Blank DSPGA packages w ill b e 
necessary in any unused DSPGA grid sit es. 

12.3 Signal Connectivity 

Each node needs t o b e connect ed t o t w o net w ork input ch annels and two net w ork out put 
ch annels. We use t h e t h rough vias on t h e DSPGA packages t o vert ically connect each node int o 
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12" 







Th e 16 rout ers in each st age of t h e net w ork (See Figure 10.1) are arrangedMigiid. 
Th e rout ers are h oused in DSPGA packages and spaced 3 inch es apart . 



Figure 12.1: Rout ing Board Arrangement for 64-processor Mach ine 
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By h ousing the processor, net w ork int erface, and b us cont rol logic in DSPGA packages, 
w e can const ruct a 6 inch square node suit ab le for st ack packaging. Th e fourt h DSPGA 
can b e b lank or h ouse addit ional logic, such as an opt ional co-processor. Memory and b us 
int erface logic are h oused in gull-w ing surface-mount packages in t h e space b et w een the 
DSPGA packages. 



Figure 12.2: Packaged MBTA Node 
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Four nodes can b e arranged in one 12 inch square st ack layer w h ich mat es mech anically 
and elect rically w it h the rout ing component layers (Figure 12.1). 



Figure 12.3: Layer of Packaged Nodes 
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the net w ork layers. As sh ow n in Figure 12.3, w e can place four nodes in each st ack layer ab ove, 
or b elow , t h e group of t h ree net w ork b oards. To h ouse 64 nodes^v=eMeedch layers 
of nodes. We can place h alf of t h e nodes ab ove and h alf b elow the net w ork layer t o minimize the 
dist ance of any node from the net w ork. Th is segregat ion leaves us w it h eigh t layers of nodes on 
each side of t h e net w ork. Th e vert ical t h rough signals on each node must run net w ork connect ions 
for t h e eigh t nodes in each node column on each side of t h e net w ork. Each of t h e eigh t nodes t aps 
off the appropriat e sub set of t h ese signals t o connect int o t h e net w ork. Mech anically, the node 
arrangement describ ed h as a rot at ional symmet ry of four. Wit h proper signal arrangement , w e can 
exploit t h is symmet ry t o allow a single node PCB design t o t ap int o any of four different vert ical 
signal runs. We can t ap int o t h e eigh t different vert ical signal runs using only two different b asic 
node designs. In t h e net w ork layers, the vert ical t h rough int erconnect w ill b e used t o arrange the 
net w ork input s and out put s so t h at h alf of t h e input s and h alf of t h e out put s are availab le on each 
side of t h e net w ork. 

12.4 Assembled Stack 

Figure 12.4 sh ow s an exploded view ofth e packaged 64-processor mach ine. Ext ernal int erfaces 
mat e w it h the nodes in t h e t op-most node layer using the same vert ical int erconnect sch eme 
suggest ed for node daugh t er b oards. Th e complet e st ack h ouses the net w ork and all 64 nodes in a 
cub ic st ruct ure rough ly"12 12" x 5" (See Figure 12.5). 
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A complet e 64-processor mach ine st ack is composed of 3 net w ork layers (Figure 12. 1) and 
16 node layers (Figure 12.3). Tw o different node PCB designs coupled w it h t h e rot at ional 
symmet ry of t h e node PCBs, allow each of t h e eigh t nodes in a vert ical column t o t ap int o 
different net w ork connect ions. 



Figure 12.4: Exploded Side View of 64-processor Mach ine St ack 
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Scale Draw ing 



Figure 12.5: Side View of 64-processor Mach ine St ack 
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13. Summary and Conclusion 



We h ave examined t h e lat ency and fault t olerance associat ed w it h large, mult iprocessor net w orks. 
We developed t ech niques at many levels for b uilding low -lat ency net w orks. We also developed 
net w orks capab le of sust aining fault s and t ech niques allow ing proper operat ion of t h ese net w orks 
in t h e presence of fault s. In t h e development , w e found no inh erent incompat ib ilit ies b et w een our 
goals of low lat ency and fault t olerance. Rat h er, w e found commonalit y b et w een t ech niques w h ich 
decrease lat ency and t h ose w h ich improve fault t olerance. 

Consequent ly, w e w ere ab le t o ident ify a rich class of net w orks w it h good lat ency and fault - 
t olerant ch aract erist ics. We paramet erized the net w orks in t h is class in several w ays. We developed 
an underst anding of h ow the net w ork paramet ers effect net w ork propert ies. Th is underst anding 
allow s us t o t ailor net w orks t o meet the requirement s of part icular applicat ions. 



13.1 Latency Review 



Comb ining t h e lat ency cont rib ut ions from Sect ion 2.4 and collapsing int o a single equat ion, w e 
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We see t h at t h ere are many aspect s w h ich cont rib ut e t o net w ork lat ency. To ach ieve low lat ency, 
w e must pay at t ent ion t o all pot ent ial lat ency cont rib ut ors and w ork t o simult aneously minimize 
t h eir effect s. In Ch apt ers 3 t h rough 7, w e addressed all of t h ese lat ency component s and examined 
w ays t o minimize t h eir cont rib ut ions. 

We considered h ow t o minimize t h e t ransit t ime b et w eeff t i)out ers ( 



T, 



tr 



'13.2) 



Th is lat ency is det ermined b y t h e speed of propagattioansd t h e t ot al dist ance t raversfed^ I4 d{ . 

We saw t h at t h e maximum speed of signal propagat ion w as det ermined b y mat erial propert ies. 

1 



fj,€ 

We also saw t h at t h is maximum w as only ach ievab le w it h proper signal t erminat ion. In Ch apt er 6, 
w e saw signalling t ech niques for ach ieving t h is maximum rat e of propagat ion w it h minimal pow er 
dissipat ion. We saw t h at t h e t raversed int erconnect dii,t depends on t h e grow t h ch aract er- 
ist ics of t h e net w ork t opology and the ach ievab le packaging densit y. In Ch apt er 3, w e looked at 
t h e int erconnect dist ance grow t h ch aract erist ics for a large class of net w orkfiadd rdcisei 
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w it h the most favorab le grow t h ch aract erist ics. In Sect ion 2.4.2, w e not ed t h at , in some sit uat ions, 
localit y can b e exploit ed t o minimize, on average, the dist ances w h ich must bet raveled inside 
the net w ork. Consequent ly, in Ch apt er 3 w e alsfiieiilajrt: iw ork t opologies w it h localit y. In 
Ch apt er 7, w e looked at t ech nologies for h igh -densit y packaging. We also looked at t opologies for 
mapping net w orks ont o t h e packaging t ech nology in a w ay t h at exploit s t h e densit y t o minimize 
int erconnect ion dist ances. 

We considered h ow t o minimize t h e t ot al numb er of rout ers w h ich must bet ra versed in a 
net w ork n . InCh apter3,w e found th at log st ruct ured sort ing net w orks gave us th elow est numb er 
of sw it ch es as long as w e rest rict ed ourselves t o b ounded degree sw it ch ing nodes (Sect ion 2.7.1). 
We also not ed t h at t h e rout er radga,ves us a paramet er w e can use t o cont rol t h e act ual numb er 
of sw it ch es t raversed in an implement at ion. Again, for some applicat ions localit y exploit at ion 
may allow us t o furt h er reduce the average numb er of sw it ch es t raversed w h en rout ing t h rough the 
net w ork. 

We not ed t h at t h e lat ency cont rib ut ed b y each rout ing component w as composed from the 
sw it ch ing t ime and t h e i/ o lat ency. 

^nl — *>io T I switch ylJ.Jj 

InCh apter6, w eid&td a signalling discipline w h ich minimized t ransit andch ipi/olat encies. We 

looked at t ech nologies for implement ifi^lOS drivers and receivers for t h is signalling discipline 

and saw h ow t o design circuit ry for realizing low -lat ency i/ o. In Ch apt er 4, w e developed a 

simple rout ing prot ocol t h at w as w ell mat ch ed t o t h e capab ilrM©S dCoimplement at ion 

t ech nologies. Th e rout ing sch eme comb ines simple, local decision making w it h a minimum 

complexit y rout ing prot ocol t o allow t h e sw it ch to perform all of it s funct ions quickly. 

To keep the cont ent ion lat en^y,and t h e t ransmission t im^», ansm8t , low , w e looked at h ow 
t o provide h igh b andw idt h in t h ese net w orks. We saw t h at increasing t h e b andw idt h availab le f 
of each connect ion w ill decrease t h e t ransmission lat ency. 
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We can increase t h is b andw idt h eit h er b y increasing the signalling batjdncreasing the dat a 
ch annel w idtfh. ,We can also not e t h at t h e low er t h e t ransmission %\gfi£yit, the fast er 
the resources used b y a connect ion are freed. As a result , decreasing t ransmission lat ency w ill also 
decrease cont ent ion lat ency. 

We not iced t h at w e can oft en reliab ly send dat a fast er t h an t h e dat a can t raverse w ires or rout ing 
component s. As a result , w esaw t h at pipelining t h e t ransmission of dat a oft en allow s us t o decrease 
the signalling clocks , considerab ly, and h ence increase b andw idt h , w it h out any negat ive impact 
on lat ency. To t h is end, w e sh ow ed h ow the rout ing prot ocol can accommodat e pipelining of dat a 
across w ires and inside rout ers (Sect ion 4.1 1). In Ch apt er 6, w e also saw h ow t h e i/ o circuit ry and 
signalling discipline allow us t o reliab ly pipeline b it s across w ires of arb it rary lengt h . 

Furt h er, w e saw t h at cont ent ion lat ency arises from inadequat e or improperly ut ilized resources 
inside th enetw ork. We saw inCh apt er 3th at dilat ed rout ers gave connect ions a ch oice of resources 
t o ut ilize t h rough out the net w ork. Th is freedom reduced the likelih ood t h at b locking w ill occur 
w it h in t h e net w ork and h ence reduced cont ent ion lat ency. In Sect ion 4.9.2, w e saw h ow fast pat h 
collapsing reduced cont ent ion lat ency furt h er b y quickly reclaiming resources allocat ed t o b locked 
connect ions. 
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In Ch apt er 3, w e also saw t h at w eh ave several opt ions for reducing cont ent ion lat ency: 

1 . We can increase the per connect ion b andw idt h by increasing t h e ch annel w idt h or signalling 
frequency, as describ ed ab ove. 

2. We can also increase the aggregat e b andw idt h of t h e mach ine b y increasing the numb er of 
input and out put connect ions b et w een each node and the net andosk, 

3. We can decrease the likelih ood of b locking b y increasing the dilaft i«Ei,t h at each 
connect ion h as more opt ions at each rout ing st age. 

13.2 Fault Tolerance Review 

In Sect ion 2.5, w e saw t h at w e cannot depend on t h e correct operat ion of every component in 
the net w ork if w e need t o ach ieve reasonab le MTTF for large-scale mult iprocessor net w orks. We 
also saw t h at t h e ab ilit y t o operat e in t h e presence of even a small numb er of fault y component s 
improves our syst em reliab ilit y, considerab ly. Th is ob servat ion led us t o look for net w orks in w h ich 
w e could maximize the dist inct resources availab le t o make any connect ions and h ence t o minimize 
the likelih ood t h at any set of fault s w ill render the net w ork disfunct ional. 

In Sect ion 2.1.1, w e not ed t h at transient fault s w ere much more likely t h an permanent faults. 
Th is fact , coupled w it h th e single-component fault rat e derived in Sect ion 2. 5, led us t ob e concerned 
w it h rob ust operat ion in t h e face of dynamically arising fault s. We found t h at w e must devise 
prot ocols w h ich do not assume the correct operat ion of any component in t h e net w ork at any point 
in t ime. Rat h er, w e must arrange the prot ocol t o verify t h e int egrit y of each net w ork operat ion. 

In Sect ion 3.3 w e examined mult ipat h net w orks and not ed t h eir pot ent ial for providing fault 
t olerance. In t h ese net w orks, the mult iple pat h s b et w een endpoint s use different rout ing resources. 
Th ese alt ernat e rout ing resources provide t h e b asis for fault -t olerant operat ion. Wh en a fault y 
component renders one pat h inoperat ive, anot h er pat h is availab le w h ich avoids the fault y compo- 
nent . In Sect ion 3.5 w e examined many of t h e det ailed w iring issues associat ed w it h mult ipat h , 
mult ist age net w orks. We saw t h at t h e numb er of connect ions t o eactoi smipQintis the 
w eakest link b et w een a node and a mult ipat h net w ork, and w e saw h ow t o make t h e b est use of 
the endpoint connect ions availab le in a part icular net w ork. We also visit ed t h e issue of w iring the 
mult iple pat h s inside the net w ork t o maximize fault t olerance. We saw different evaluat ion crit eria 
b ased on w h et h er net w ork connect ivit y is view ed as a yield prob lem or as a h arvest prob lem. If w e 
allow node isolat ion, w e saw t h at randomly-w ired net w orks generally b eh ave most rob ust ly in t h e 
face of fault s. Wh en node isolat ion is not permit t ed, w e found t h at det erminist ic, maximum-fanout 
net w orks generally survive more fault s. 

We also saw t h at w e can cont rol t h e amount of redundancy, and h ence fault t olerance in t h ese 
net w orks, b y select ing th e rout er dildt and, t h enumb er of node input and out put connections, 
ni and no. We can adjust ni and no t o cont rol t h e mean t ime t o node isolat ion. For t h e h arvest 
case, w h ere node isolat ion is not allow ed, increasing the numb er of node input s and out put s is 
prob ab ly t h e most effect ive w ay of increasing fault t olerance. We can adjust the dilat ion t o cont rol 
the amount of pat h fanout w it h in t h e net w ork and h ence the numb er of pat h s provided b et w eer 
endpoint s. Increasing the dilat ion is effect ive for increasing fault t olerance inboth thetheh arvest 
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and yield sit uat ions. How ever, due t o t h e different reliab ilit y met rics w e use in t h ese two cases, 
increasing the dilat ion is much more effect ive in t h e yield case t h an in t h eh arvest case. 

Mult ipat h net w ork t opology, h ow ever, only gave us t h e pot ent ial for fault -t olerant operat ion. To 
realize t h at pot ent ial, w e not ed t h at t h e rout ing sch eme must b e ab le t o det ect w h en failures occur and 
b e ab le t o exploit the mult iple pat h s t o avoid fault s. In Ch apt er 4, w e developed such a sch eme for 
rout ing on t h e mult ist age, mult ipat h net w orks det ailed in Ch apt er 3. End-t o-end message ch ecksums 
guard each dat a t ransmission against unnot iced corrupt ion. End-t o-end acknow ledgment s and 
source-responsib le ret ry w ork t oget h er t o guarant ee each message is delivered at least once w it h out 
corrupt ion. Random select ion of a part icular pat h t h rough the mult ipat h net w ork coupled w it h 
source-responsib le ret ry, guarant ees t h at any non-fault y pat h b et w een any source-dest inat ion pair 
can event ually b e found. Comb ining t h ese feat ures, the rout ing prot ocol ach ieves correct operat ion 
w it h out requiring any know ledge of t h e fault s w it h in t h e net w ork. 

We saw t h at w e could minimize t h e performance impact of fault y component sand int erconnect 
on t h e net w ork b y ident ifying t h em and masking t h em from the net w ork. A know n, masked fault 
is det erminist ically avoided. Th is avoidance allow s t h e random pat h select ion t o converge more 
quickly on a good pat h b y removing all know n b ad pat h s from the space of pot ent ial pat h s. We 
also saw t h at ident ifying fault s allow s us t o make assessment s ab out t h e int egrit y of t h e net w ork 
(Sect ion 5.1, Sect ion 5.7). 

We developed minimally int rusive mech anisms for locat ing fault s. Th e rout ing prot ocol uses 
the pipeline delay cycles associat ed w it h reversing the direct ionffafvdat across the net w ork t o 
transmit routerch ecksums and det ailed connect ion informat ion b acktoth e source. Th is informat ion 
h elps narrow dow n t h e source of any fault s. Port -b y-port deselect ion and part ial-ext ernal scan 
(Ch apt er 5) allow the syst em t o isolat e regions of t h e net w ork and t est for fault s. Since the net w ork 
h as redundant pat h s, port ions of t h e net w ork can b e isolat ed and t est ed in t h is manner w it h out 
int erfering signficant ly w it h normal operat ion. 

Finally, w e saw t h at the mech anisms used for fault isolat ion and t est ing, coupled w it h the 
mult iple path sw it h in each net w ork, provide facilit ies for in-operat ion repair. Ph ysically replaceab le 
sub unit s can b e isolat ed, repaired, and ret urned t o service w it h out t aking the ent ire net w ork out 
of service (Sect ion 5.6, Sect ion 7.5.6). On-line repair allow s us t o minimize or eliminat e syst em 
dow n-t ime and h ence maximize syst em availab ilit y. 

13.3 Integrated Solutions 

We h ave describ ed a set of t ech niques for b uilding rob ust , low -lat ency mult iprocessor net - 
w orks. Th ese solut ions span a range of implement at ion levels from VLSI circuit s, packaging, and 
int erconnect up t h rough arch it ect ures and organizat ions. Each t ech nique present ed is int erest ing in 
it sow nrigh t forth e feat ures anfilbsiln©ffers. How ever, th e collect ion of t ech niques present ed 
h ere is most int erest ing b ecause t h e t ech niques int egrat e smoot h ly int o a complet e syst em. Wh en 
assemb led, w e do get a syst em w h ich reaps the cumulattisofferad b y all of t h e t ech niques. 
Th e feat ures of many of t h e t ech niques compliment each ot h er such t h at the overall feat ures and 
b enfit s of t h e composit e syst em are great er t h an t h e feat ures of t h e individual pieces. 
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A. Performance Simulations 



In t h is appendix, w e w ill describ e t h e simulat ions used t o measure net w ork performance. We w ill 
b egin b y describ ing some feat ures of t h e b asic arch it ect ure modeled. We revist some pract ical 
issues of net w ork const met ion involving t h e input sand out put soft h enet w ork. Finally, w e develop 
met h ods for exercising net w orks b ased on represent at ive net w ork loads t aken from sh ared memory 
applicat ions. 

A.l The Simulated Architecture 

Th enetw orks simulat ed use a circuit -sw itch ed rout ing component b ased upon RN1 (See Ch ap- 
t er 8) and Met ro (See Ch apt er 9). Th e part icular component used t h rough out t h ese experiment s 
can act eit h er as a single 8-input , radix-4, dilat ion-2 rout er, or as t w o independent 4-input , radix-4, 
dilat ion-1 rout ers. 

To aid the rout ing of messages, each component w ill h ave a pin dedicat ed t o caldWa* ing 
cont rol informat ion according t o t h e follow ing b locking crit erion t aken from Leigh t on and Maggs 
in [LM92]. A rout er is blocked if it does not h ave at least one unused, operat ional out put port in 
each logical direct ion w h ich leads t o a rout er w h ich is not b locked. To rout e a message, a rout er 
at t empt s t o ch oose a single out put poiftrit Rooking at unused port s, and second eliminat ing any 
port s w h ich are b locked. If no unique ch oice -ariadt port s unused, b ut all unb locked or all 
b locked — t h en t h e rout er randomly decides b et w een port s. 

Each ch ip also incorporat es a serial t est -access port (TAP) w h ich accesses. Th ese port s, in t urn, 
are connect ed t oget h er in a diagnost ic net w ork w h ich can provide in-operat ion diagnost ics and ch ip 
reconfigurat ion as det ailed in Ch apt er 5. 

A.2 Coping with Network I/O 

Much of t h e fault t olerance and rout ing b eh avior of our net w orks is dominat fidsb antlh e 
last st ages. Alt h ough mult ipat h net w orks provide mult iple pat h s b et w een any two nodes, t h ese 
pat h s can only use a large numb er of ph ysically dist inct rout ers t ow ards the middle of t h e net w ork. 
Near the nodes, t h ese pat h s must concent rat e t ow ards tiieaanipmsiand dest inat ions. Th is 
concent rat ion is most severe in t ffiist and last st ages, w h ere each node only h as a small numb er 
of connect ions t o t h e net w ork (See Sect ion 3.5.2). For the sake of t h ese simulat ions w e assume 
dilat ion one rout ing component s are used in t IfinEal st age of t h e net w ork?. ( Figure 3.11). 



Th is informat ion is reprint ed w it h sligh ftcntddh from [Ch o92] 
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A.3 Network Loading 

In t h is sect ion, w e derive net w ork loads for use w it h our performance simulat ions. We need t o 
run a large numb er of simulat ions t o ob t ain average performance of each net w ork at various fault 
levels. Consequent ly, w e use simple synt h et ic loadings t o keep simulat ion t ime manageab le. We 
start b y using uniformly dist rib ut ed random dest inat ions for our messages, ffifemir model b y 
looking at sh ared-memory applicat ions st udied b yth e MIT Alew ife Project [CFKA90]. 

A.3.1 Modeling Shared-Memory Applications 

Our goal is t o provide a realist ic model of net w ork ut ilizat ion t h at can b e used t o compare 
many different net w orks and paramet ers. To keep simulat ion t ime t ract ab le, w e use a variant of 
uniform t rafic. Our simulat ion sends messages t o random dest inat ions in a uniform dist rib ut ion. 
How ever, message lengt h s are randomly generat ed according t o dist rib ut ions derived frorficspeci 
parallel applicat ions. Th ese applicat ion w ere t aken from each ing st udies done b y t h e Alew ife 
Project [CFKA90]. Th ese st udies simulat e a sh ared memory arch it ect ure w ith coh erent each es at 
each processing node. Dat a t aken from t h is st udy corresponds t o t h e follow ing syst em paramet ers: 

• Sh ared memory, coh erent each es 

• Full-map direct ories 

• 16-b yt e each e lines 

• 64 nodes, corresponding t o 3-st age, radix-4 net w orks. 

• Single t h read 

• CISC inst met ions 

• 1 memory reference per inst met ion 

• Processors st all aft er 1 out st anding memory reference 

• Barrier synch ronizat ion 

A.3.2 Application Descriptions 

[CFKA90] studied four applicat ions: SIMPLE, SPEECH, FFT, and WEATHER. SIMPLE models 
t h e h ydrodynamic b eh avioflirifls using finit e difference met h ods t o solve the equat ions in 
two dimensionsSPEECH is t h e lexical decoding st age of a ph onet ically-b ased spoken language 
underst anding syst em. It uses a variant of t h e Vit erb i search algoFitlfisraradix-2 Fast Fourier 
Transform. WEATHER uses finit e-difference met h ods t o solve part ial different ial equat ions w h ich 
model t h e at mosph ere around the glob e. 

Th ese applicat ions at t empt t o represent t h ree major classes of prob lems: graph prob lems, 
cont inuum prob lems, and part icle prob lems. Graph prob lems involve search ing and irregular 
communication. Cont inuum prob lems generally h ave localized communicat ion in regular pat t erns. 
Part icle prob lems oft en involve communicat ion over long dist ances t o simulat e int eract ions such 
as t h ose due t o gravit at ional forces. 
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A.3.3 Application Data 

Cach e-coh erent sh ared memory syst ems ut ilize a small set of message t ypes, eaclfixcxl a 
lengt h . Th e messages sent t h rough the net w ork read cach e lines, w rit e cach e lines, and maint ain 
cach e coh erency. A cach e-line read, for example, requires an 8-b yt e read request follow ed b y a 
reply cont aining the dat a. Th e reply consist s of an 8-b yt e h eader follow ed b y 16 b yt es represent ing 
the desired cach e line. 

Tab le A.l list s t h e frequency of all t ransact ions for our four applicat ions. Th e messages sent 
for each t ransact ion are also list ed. In cont rast t o t h e Alew ife st udy, it is import ant for us t o 
dist inguish w h ich t ransact ions are split ph ase. Our net w orks are circuit -sw it ch ed and can save 
rout ing t ime if a reply t o a request is immediat ely availab le. How ever, if t h e t ransact ion must b e 
split intotw oph ases, t w o messages w ill h avetob e rout ed separat ely. Tab le A. 1 dist inguish esth ose 
messages w h ich are single ph ase, alw ays split ph ase, and somet imes split ph ase. Tab le A.2 gives 
the percent age split ph ase for t h ose w h ich are somet imes split ph ase. Assuming a t w o rout er cycles 
per processor cycle, our dat a gives us a 3, 6, 7, and 9 percent approximat e message generat ion rat e 
per rout er cycle forWEATHER, SIMPLE, SPEECH, and FFT, respect ively. 

Tab le A.2 also gives the approximat e grain sizes of each applicat ion. Tfigmes w ill b e 
used t o det ermine frequency of b arrier synch ronizat ion, t o b e discussed in Sect ion A.3.4. Finally, 
Tab le A. 3 gives the relat ive frequency of each lengt h of message for each applicat ion. Th is is 
summarized b y t h e average lengt h of messages given for each applicat ion. 

A.3.4 Synchronization 

Our performance simulat ion includes b arrier synch ronizat ion t o eliminifee'. Our simulat ion 
models applicat ions w h ich assume some degree of synch ronizat ion b et w een the mult iple processor 
nodes of t h e syst em. Wh en a simulat ion violat es t h is assumpt ion, result s are skew ed. We prevent 
t h is skew b y performing periodic b arrier synch ronizat ion according t o grain sizes est imat ed for 
each applicat ion. 

It is import ant t o eliminat e simulat ion skew b ecause it can mask the effect s of localized 
net w ork degradat ion. Analyt ic models suggest t h at fault s and congest ion may severely affect the 
performance ob served b y spefei dest inat ion nodes w h ile leaving ot h ers largely unaffect ed [KR89] . 
Wit h out synch ronizat ion, such localized degradat ion w ould b e lostvanagh KOb andw idt h 
ut ilizat ion. Modeling synch ronizat ion, h ow ever, forces all processors tow ait for t h ose falling 
b eh ind, result ing in a more realist ic decrease in 1/ O b andw idt h ut ilizat ion. 

A.3.5 The flat24 Load 

We also simulat e a uniform message dist rib ut ioflLAT24. FLAT24 uses 24-b yt e messages 
and ot h er simulat ion paramet ers w h ich are similar t o t h e messages and paransettersoihd 
SPEECH. FLAT24 serves as a b asis for net w ork comparison. 

For our lat er st udies, w e sh all also EK&T24, b ut w e sh all use paramet ers w h ich differ sligh tly 
from t h ose in t h e Alew ife st udy and correspond more closely t o our t arget arch it ect ures. 

To derive th e frequency of message loading, severdteasonab IS paramet er values w erech osen. 
Not e t h at our result s are not overly sensit ive t o loading, so rfigghes are adequat e. Th e program 
code for each processor is assumed t o b e resident in local memory. Consequent ly, only dat a 
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Message Type 


WEATHER 


SIMPLE 


SPEECH 


FFT 


read miss, w rit e mode 

[h dr,h dr+dat a] (h dr,h dr+c 


0.4400 
it a) 


0.4500 


1.8700 


0.9600 


read miss, not w rit e mode 
(h dr,h dr+dat a) 


0.8400 


4.2400 


1.3500 


1.7500 


read miss, not in any each e 
(h dr,h dr+dat a) 


0.6700 


0.7700 


0.0800 


0.1300 


w rit e miss, w rit e mode 
[h dr,h dr+dat a] (h dr,h dr+c 


0.6400 
it a) 


0.6300 


0.0100 


2.5100 


w rit e miss, not w rit e mod 
?h dr,h dr+dat a? 


0.0000 


0.1900 


0.0000 


0.1000 


w rit e h it , not w rit e mod 
?h dr,h dr? 


0.5500 


0.4500 


1.8500 


0.9900 


w rit e miss, not in any each 
(h dr,h dr+dat a) 


s 0.3300 


0.3000 


0.1500 


0.0000 


inst ruct ion miss 
(h dr,h dr+dat a) 


0.0700 


0.1900 


0.0000 


0.2700 


privat e misses 
(h dr,h dr+dat a) 


0.1159 


0.1038 


0.0013 


0.1821 


invalidat ions 
(h dr,h dr) 


0.4318 


1.2562 


2.7455 


2.8004 


evict ions 
(h dr,h dr) 


0.0000 


0.0000 


0.0000 


0.0000 


replacement s of dirt y dat a 
(h dr+dat a) 


0.0407 


0.4512 


0.0090 


0.0196 


synch ronizat ions not each e 
(h dr,h dr+dat a) 


0.0000 


0.0000 


0.0000 


0.0000 



h dr = packet h eader, dat a = each e line 

[...] = split ph ase message comb inat ion 

(...) = single ph ase 

?...? = somet imes split ph ase 

Relat ive t ransact ion frequencies for each of our four applicat ions, in t ransact ions per pro- 
cessor cycle, are given ab ove. Messages sent for each kind of t ransact ion are also given. 
(Dat a Court esy of David Ch aiken) 

Tab leA.l: Relat ive Transact ion Frequencies for Sh ared-Memory Applicat ions 
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WEATHER 


SIMPLE 


SPEECH 


FFT 


percent split ph as;: 85.5560 


91.7210 


100.0000 


88.8396 


grain size 


59769(±) 
187714(i) 


7271 


1000 1 o 
10000 


28289 



For each applicat ion, the percent split ph ase is given for t h ose list ed as somet imes split 
ph ase (?. . ?) in Tab le A.l. Approximat e grain sizes are also given for each applicat ion. 
Th ere are two grain sizes feEATHER, one for each of t w o ph ases in t h e applicat ion. 

Tab le A.2: Split Ph ase Transact ions and Grain Sizes for Sh ared-Memory Applicat ions 





WEATHER 


SIMPLE 


SPEECH 


FFT 


8-b yt e 


2.9211 


2.0798 


5.5800 


5.3179 


16-b yte 


0.5112 


1.2936 


2.7455 


2.9109 


24-b yte 


1.1207 


1.7055 


1.8890 


3.5784 


32-b yte 


2.4359 


5.9295 


3.3813 


5.6832 


Average Lengt h 


21.2178 


24.3463 


17.8074 


20.4033 



Relat ive frequencies of each lengt h of message are given. Th ese are summarized b y t h e 
average lengt h of messages for each applicat ion. 

Tab le A. 3: Message Lengt h s for Sh ared-Memory Applicat ions 

references w ill result in non-local memory references. We assumed a dat a each e miss rat e of 15 
percent. For each dat a read or w rit e, w e get 0.15 misses per processor cycle. We assumed th at 50 
percent ofth e references are t o local memory, w h ich gives us 0.075 references t o non-local memory 
per processor cycle. Wit h a 50 MHz processor and the RN1 part running at b et t er t h an 100 MHz, 
w e h ave two rout er cycles per processor cycle. Th is gives us 0.0375 non-local memory references 
per rout er cycle. Adding an addit ional 10 percent t o account for each e coh erency messages, w e 
end up w it h approximat ely 0.04 messages per rout er cycle. 

We also examine t ime b et w een synch ronizat ions, or, equivalent ly, applicat ion grain size. A 
grain size represent at ive of applicat ions st udied is 10,000 cycles, or J 000 X 0.04 = 400 messages. 
Alt oget h er, our performance met ric is t h e t ime t o rout e t h e follow ing t ask: all processors in t h e 
syst em must each send 400 24-b yt e messages at a rat e of 0.08 messages per act ive processor cycle. 
We assume t h at each processor can h ave up t o 4 1 h reads, or t asks, each w ith an out st anding message, 
b efore st ailing. 

Not e t h at our t ask explicit ly models b arrier synch ronizat ion. Ot h er st yles of synch ronizat ion 
may involve smaller processor groups, b ut may also t end t o synch ronize t h ese groups more oft en. 
In any case, it is import ant t o model the synch ronizat ion requirement s of an applicat ion. For our 
purposes, b arrier synch ronizat ion incorporat es an appropriat e component of t h ese requirement s 
int o our performance met ric. We see t h at synch ronizat ion plays a major role in performance 
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degradat ion. Th is degradat ion occurs w hen net w ork failure result sin a small numb er of nodes w ith 
part icularly poor communicat ion b andw idt h . 

Let us t ake one more look at our numb ers. Since messages are 24 b yt es long, w e are b asically 
running our net w ork at 1 b yt e per rout er cycle, or 100 percent . If t h e message rat e w ere any 
h igh er, t h e processors w ould just b e st ailed more oft en, and the net w ork loading w ould not really 
ch ange. Not e t h at our analysis assumes low -lat ency message h andling, a concept demonst rat ed in 
the J-Mach ine tB2]. If, as w it h many commercial and research mach ines, t h ere exist s a h igh 
lat ency for message h andling, t h e lat ency induces a feedb ack effect w h ich prevent s full ut ilizat ion 
ofth e net w ork [Joh 92]. Alt h ough each e miss and message localit y numb ers are open t o deb at e, w e 
ob serve t h at t h e t ech nological t rends of mult iple-issue processors and w ider each e lines w ill only 
increase demands on t h e net w ork. How ever, performance result s present ed in t h is paper w ere also 
verified t o b e qualit at ively unch anged under net w ork loading h alf of t h at used h ere. 

A.4 Performance Results for Applications 

In t h is sect ion w e summarize our performance result s for each net w ork in t h e presence of 
fault s. Performance w as measured for complet e net w orks only. For each fault level sh ow n, 
mult iple t rials w ere run in w h ich fault s w ere randomly ch osen. Aft er fault insert ion, applicat ions 
w ere simulat ed on t h ose net w orks w h ich remained complet e. Dat a sh ow n represent the average 1/ O 
b andw idt h ut ilizat ion and lat encies over t h ose t rials involving complet e net w orks. Th ese result s do 
not represent t h e act ual performance of t h ese applicat ions on h ardw are. Rat h er, our dat a provides 
a b asis for net w ork comparison b y illust rat ing performance t rends in t h e presence of fault s. 

To isolat e t h e effect s of int erw iring, the det erminist ic and random net w orks present ed use 
dilat ion-2 component s in t h e last st age. We st udied 3-st age non-int erw ired, randomly-int erw ired, 
and det erminist ically-int erw ired net w orks. 1/ O b andw idt h ut ilizat ion varied b y less t h an 2 percent , 
a variat ion not sigrficant for our simulat ion accuracy. 

How ever, t o avoid single-point disconnect ions in t h e last st age, t h e int erw ired net w orks w e 
sh all analyze for fault performance are const ruct ed from dilat ion-1 component s in t h e last st age. 
Alt h ough dilat ed component s provide b et t er performance, the use of t h ese dilat ion-1 component s 
sub st ant ially increases fault t olerance (See Sect ion 3.5.2). For applicat ions st udied on 3-st age 
net w orks, the decrease in 1/ O b andw idt h ut ilizat ion w as less t h an 6 percent . 

Figures A.l and A.2 details the fault performance of our applications on 3-st age, radix-4, 
net w orksw h ich canw it h st and fault s. Th euppt2jfe,tabnDW n as a solid line, is represent at ive 
of graph t rends and w ill b e used in net w ork comparisons. 

Figures A.3 and A.4 compares the performance of t h ose net w orks w h ich can t olerat e fault s. 
Th e performance of t h e random net w ork is sligh t ly b et t er t h an t h at of t h e det erminist ic net w ork. 
How ever, recall t h at figures are for complet e net w orks only. For each fault level sh ow n, t h e 
random net w orks h ave a low er prob ab ilit y of remaining complet e t h an t h e det erminist ic net w orks. 
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Figure A. 1 : Applicat ions on 3-st age Random Net w orks 
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Figure A.3: Comparat ive Performance of 3-St age Net w orks 
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b y random and det erminist ic net w orks are, respect ively: 4.6%, and 8.8%. Not e t h at the 
performance degradat ion appears t o level off b ecause only complet e net w orks are measured. 
Alt h ough the surviving net w orks suffer less degradat ion as percent age of failure increases, 
the numb er of surviving net w orks is b ecoming sub st ant ially smaller. 



Figure A.4: Comparat ive Performance of 4-St age Net w orks 
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